<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_01_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

## **Class_01_3: Lists, Dictionaries, Sets and JSON**

##### **Module I: Getting Started with Python**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 1 Material

* Part 1.1: Python Basics 1 -- Introduction to Google Colab
* Part 1.2: Python Basics 2 -- Strings, Variables, Functions
* **Part 1.3: Python Basics 3 -- Lists, Dictionaries, Sets and JSON**
* Part 1.4: Python Basics 4 -- Conditionals and Loops
* Part 1.5: Python Basics 5 -- Packages, NumPy arrays and Matplotlib
* Part 1.6: Python Basics 6 -- Pandas and File Handling

## Google Colab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to ```/content/drive``` and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    Colab = True
    print("Note: Using Google Colab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was not printed out in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    Colab = False

If the code is correct, you should see something similar to the following output, but your GMAIL address will be printed out.
~~~text
Mounted at /content/drive
Note: Using Google Colab
david.senseman@gmail.com
~~~

If your GMAIL address is not visible, your submission will not be graded.


# **Python Basics 3 -- Lists, Dictionaries, Sets and JSON**

Python includes **Lists**, **Sets**, **Dictionaries**, and other data structures as built-in types. The syntax appearance of both of these is similar to **JSON** which is discussed later in this module.

This course will focus primarily on Lists, Sets, and Dictionaries. It is important to understand the differences between these three fundamental collection types.

* **List** - A list is a mutable ordered collection that allows duplicate elements.
* **Tuple** - A tuple is an immutable ordered collection that allows duplicate elements.
* **Dictionary** - A dictionary is a mutable unordered collection that Python indexes with name and value pairs.
* **Set** - A set is a mutable unordered collection with no duplicate elements.

Most Python collections are mutable, meaning the program can add and remove elements after definition. One notable exception is a Python **tuple** which is an immutable collection which means that items cannot be added or removed after its definition.

It is also essential to understand that an ordered collection means that items maintain their order as the program adds them to a collection. However, this order might not be any specific ordering, such as alphabetic or numeric.

Lists and tuples are very similar in Python and are often confused. The significant difference is that a list is mutable, but a tuple isn’t. So, we include a list when we want to contain similar items and a tuple when we know what information goes into it ahead of time.

## **Lists and Tuples**

For a Python programmer, lists and tuples look very similar. Both lists and tuples hold an ordered collection of items. It is possible to get by as a programmer using only lists and ignoring tuples.

The primary difference is that a list is enclosed by square braces `[ ]`, while a tuple is enclosed by parenthesis `( )`.

The following code defines both list and tuple. The code below also illustrates that Python indexes lists starting at element `0`. Accessing element one modifies the second element in the collection. One advantage of `tuples` over `lists` is that `tuples` are generally slightly faster to iterate over than `lists`.

Play this YouTube video to see a visual description of Python **lists**.

In [None]:
from IPython.display import HTML
video_id = "spjE6cmV1Cs?"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen
  referrerpolicy="strict-origin-when-cross-origin">
</iframe>
""")

### Example 1:  Create a list called `myList`

The code in the cell below uses square brackets `[ ]` to create a list called `myList`. (Note: `list` is a Python **reserved word** so don't use it as a variable name without adding something to it).

In [None]:
# Example 1: Create a list

# Use square brackets to create a list
myList = ['a', 'b', 'c', 'd']

# Print output
print(myList)

If the code is correct, you should see the following output:

~~~text
['a', 'b', 'c', 'd']
~~~

The square brackets tells you that this is Python `list`.

### **Exercise 1: Create a tuple called `myTuple`**

In the cell below use parentheses `( )` to create a tuple called `myTuple` with the letters `A`, `B`, `C` and `D`. (Note: `tuple` is a Python reserved word so don't use it a variable name without adding something to it).

In [None]:
# Insert your code for Exercise 1 here



If the code is correct, you should see the following output but perhaps in a different order:
~~~text
('A', 'B', 'C', 'D')
~~~

The parentheses tells you that this is Python `tuple`.

### Example 2: Change the contents of `myList`

As mentioned above, a `list` is mutable, which means its contents can be changed after it has been created.

The code in the cell below demonstrates that the program can change a list. This example uses square bracket`[ ]` indexing to change the second element in `myList`. (Remember: Python starts counting sequences at 0.)


In [None]:
# Example 2: Change the second element of myList

# Change the second element to Z
myList[1] = 'Z'

# Print output
print(myList)

If the code is correct, you should see the following output but perhaps in a different order:
~~~text

['a', 'Z', 'c', 'd']
~~~

This demonstrates that a program can change the contents of a `list` after it has been created.

### **Exercise 2: Change the contents of `myTuple`**

As mentioned above, a `tuple` is **immutable**, which means its contents can not be changed after it has been created.

Using Example 2 as a template, write the code in the cell below to change the second element in `myTuple` to letter Z.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct, you should see the following error message:

![___](https://biologicslab.co/BIO1173/images/class_01/class_01_3_image01E.png)

As expected, Python will generate an error if you try to change the contents of `tuple` after it has been created. In other words, a `tuple` is immutable.

P.S. It's OK to turn in your lesson with Error Message since the error was _intentional_.


## **Difference Between List and Tuple**

For a Python program, lists and tuples are very similar. Both lists and tuples hold an ordered collection of items. It is possible to get by as a programmer using only lists and ignoring tuples.

The primary difference that you will see syntactically is that a list is enclosed by square braces `[ ]`, and a tuple is enclosed by parenthesis `( )`.

In [None]:
# Run this example

myList = ['a', 'b', 'c', 'd']
myTuple = ('a', 'b', 'c', 'd')

print(f"This is a Python list: {myList}")
print(f"This is a Python tuple: {myTuple}")


If the code is correct you should see the following output:

~~~text
This is a Python list: ['a', 'b', 'c', 'd']
This is a Python tuple: ('a', 'b', 'c', 'd')
~~~


Watch this YouTube video to see the difference between a Python **lists** and **tuples**.

In [None]:
from IPython.display import HTML
video_id = "11WrzU81q68?"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen
  referrerpolicy="strict-origin-when-cross-origin">
</iframe>
""")

## **Python Dictionaries**

A Python **dictionary** or **`dict`** is a mutable unordered collection of key/value pairs. A mutable data type in Python is a type of object whose value can be changed after it is created. It means that you can modify, add, or remove elements within the object without creating a new instance of it.

For example, lists (`list`) and dictionaries (`dict`) in Python are mutable data types. You can change the elements of a list or update the key-value pairs of a dictionary without creating a new list or dictionary. An important aspect of mutability is that if you have multiple references to the same mutable object, any modifications made to the object will be reflected in all the references.

In contrast, immutable data types like strings (`str`), tuples (`tuple`), and numbers (`int`, `float`) cannot be changed after they are created. If you want to modify these types, you need to create a new instance with the desired changes.

Like other collection types, `dict` can be called with a collection argument to create a dictionary with the elements of the argument. However, those elements must be tuples or lists of two elements — a key and a value. Dictionaries are enclosed within curly braces `{ }`, and each item is separated by a comma. The key-value pairs in a dictionary are separated by a colon `:`.

Watch this YouTube video to see a visual introduction to Python **dictionaries**.

In [None]:
from IPython.display import HTML
video_id = "4t10v2QmTHU?"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen
  referrerpolicy="strict-origin-when-cross-origin">
</iframe>
""")

### Example 3: Create a dict called `DNABaseDict`

The code in the cell below uses Python's `dict()` function to create a dictionary called `DNABaseDict`.

The dictionary's keys are the single-letter abbreviation of the four bases in DNA. Each key has an associated value which is the name of the base. The key/value pairs in this example are contained in a list since they are defined using square brackets `[  ]`.

In [None]:
# Example 3: Multiple two integers

# create dict using dict() function
DNABaseDict = dict((['A', 'adenine'],
                    ['C', 'cytosine'],
                    ['G', 'guanine'],
                    ['T', 'thymine']
                   ))

# Print out the dictionary
DNABaseDict

If the code is correct, you should see the following output:

~~~text
{'A': 'adenine', 'C': 'cytosine', 'G': 'guanine', 'T': 'thymine'}
~~~

The curly braces `{ }` tells you that this is a Python dictionary. This particular dictionary contains 4 `key/value` pairs with a colon `:` separating each `key` from its corresponding `value`.

### **Exercise 3: Create a dict called `RNABaseDict`**

In the cell below use Python's `dict()` function to create a dictionary called `RNABaseDict`. The dictionary's `keys` should be the single-letter abbreviations of the four RNA nitrogenous bases with the corresponding `value` being the name of the base. Print out the RNABaseDict.

You should already know the four nitrogenous bases in RNA. If you don't, you can "Google" it.

In [None]:
# Insert your code for Exercise 3 here



If the code is correct, you should see the following output:

~~~text
{'A': 'adenine', 'C': 'cytosine', 'G': 'guanine', 'U': 'uracil'}
~~~

----------------------------

### **Why the curly braces `{ }`?**

Even though parentheses `( )` were used to define the `DNABaseDict` dictionary in Example 3, Python printed out this dictionary using curly braces `{ }`.

> **So why the curly braces?**

Because they are so frequently used, Python provides a notation for dictionaries that is similar to `sets`: a comma-separated list of `key/value` pairs enclosed in curly braces `{ }`. Within the braces, a colon `:` is used to separate each `key/value` pair. When Python prints out a dictionary, it uses the curly brace format.


----------------------------

### Example 4: Create `dnaBaseDict` using curly braces `{ }`

As explained above, you can also create a Python dictionary using curly braces `{ }`. The cell below shows how to create a dictionary called `dnaBaseDict` using curly braces. This dictionary is similar to `DNABaseDict` created in Example 2 except that lower case letters are used for the keys.

In [None]:
# Example 4: Create dictionary using {}

# Create dictionary using curly braces
dnaBaseDict = {'a': 'adenine', 't': 'thymine', 'g': 'guanine',  'c': 'cytosine'}

# Print out the dictionary
dnaBaseDict

If the code is correct, you should see the following output but perhaps in a different order:

~~~text
{'a': 'adenine', 't': 'thymine', 'g': 'guanine', 'c': 'cytosine'}
~~~

### **Exercise 4: Create `rnaBaseDict` using curly braces `{ }`**

In the cell below create a dictionary called `rnaBaseDict` using curly braces. Use lower case letters for the `keys`.

In [None]:
# Insert your code for Exercise 4 here



If the code is correct, you should see the following output but perhaps in a different order:

~~~text
{'a': 'adenine', 'c': 'cytosine', 'g': 'guanine', 'u': 'uracil'}
~~~

-------------------------------

### **Unique Keys**

The keys of a mapping must be **unique** within the collection, because the dictionary has no way to distinguish different values indexed by the same key.

--------------------------------

## **Python Sets**

A Python **set** is an _unordered_  collection of items that contains no duplicates. As we will see, if you try to add an item that is already in a set, nothing happens.

Since strings behave as collections, a string can be used as the argument for a call to set. The resulting set will contain a **single-character string** for each unique character that appears in the argument. The order in which the elements of a set are printed will not necessarily bear any relation to the order in which they were added.

Watch this YouTube video to see the difference between Python **lists**, **tuples** and **sets**.

In [None]:
from IPython.display import HTML
video_id = "11WrzU81q68?"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen
  referrerpolicy="strict-origin-when-cross-origin">
</iframe>
""")

### Example 5: Create a Set called `DNABases_set`

The code below shows how to create a set of single-character strings called `DNABase_set` using curly braces `{ }`.

In [None]:
# Example 5: Create set

# Create a set called DNABases_set
DNABases_set = {'T', 'C', 'A', 'G'}

# Print DNABases_set
print(DNABases_set)

If the code is correct, you should see the following output but perhaps in a different order:

~~~text
{'T', 'A', 'C', 'G'}
~~~
You should notice two things. First our `set` is _not_ the entire string `TCAG`, but a collection of 4 single-character strings ("letters"), `A`, `C`, `G` and `T`.

Second, the order of the "letters" when we created the set (`TCAG`) may, or may not, be preserved.

Also, a set is an _unordered_  collection. If you tried to use square brackets `[ ]` to index an item in a set, you would get the error message: `set object is not subscriptable`.

### **Exercise 5: Create a Set called `RNABases_set`**

In the cell below, create a new set called `RNABases_set` and print it out. Remember that in RNA the base `uracil` substitutes for the DNA base `thymine`.

In [None]:
# Insert your code for Exercise 5 here



If your code is correct, you should see the following output but perhaps in a different order:

~~~text
{'U', 'A', 'C', 'G'}
~~~

### Example 6: Algebraic set operations - Union

In Python there are a number of operations and functions that work on different collection types such as sets. In this example, we show one example of an operation called `union`.

The "adding" of one set with another is called the **union of the two sets**. In Python, you can use the `|` operator to create a union of two sets as shown in the next cell.

In [None]:
# Example 6: Union of 2 sets

# Create a new set called AddBases_set
AddBases_set =  {'X', 'Y', 'Z', 'U', 'U', 'A','A'}

# Use | to create union
RNABases_set_union = RNABases_set | AddBases_set

# Print the new set
print(RNABases_set_union)


If the code is correct, you should see the following output but perhaps in a different order:

~~~text
{'Z', 'X', 'C', 'U', 'G', 'Y', 'A'}
~~~

Notice that when we add the two sets together, only the letters `X`, `Y`, and `Z` were added to `RNABases_set`, not the additional `Us` and `As`.

> **Why?**

Because every element in a set must be unique. Since our original `RNABases_set` already contained the letters `U` and `A`, they were not added, only the new letters, `X`, `Y` and `Z`. In other words, a set can only contain **one example of each element**.

**NOTE:** In order for this example to run correctly, you must have successfully completed **Exercise 5** above.

### **Exercise 6: Try to create a set with duplicated items**

Because each element in a set must be unique, when you try to create a set with duplicated items, you don't get an error, but only one item will be added to the set.

In the cell below, create a set called `RNABases_set2` with {'U', 'A', 'A', 'G', 'U', 'C', 'C'} and then print out the set.

In [None]:
# Insert your code for Exercise 6 here



If your code is correct, you should see the following output but perhaps in a different order:

~~~text
{'U', 'A', 'G', 'C'}
~~~

The new `RNABase_set2` only contains one example of each item.

### Example 7: Algebraic set operations - Intersection

Another algebraic set operation is **intersection**.

The cell below uses the `&` operator to find the intersection of two sets.

In [None]:
# Example 7: Set intersection using & operator

# Create 2 sets using curly braces
let1_set = {'a','b','c','d','e'}
let2_set = {'c','d','e','f','g'}

# Use `&` to find their intersection
let_set_intersection = let1_set & let2_set

# Print out the intersection
print(let_set_intersection)


If the code is correct, you should see the following output but perhaps in a different order:

~~~text
{'e', 'c', 'd'}
~~~

Set intersection is the set of elements that **both sets have in common**. In this example, only the letters `c`, `d` and `e` were contained in both sets.

### **Exercise 7: Algebraic set operations - Intersection**

In Example 7, set intersection was found using the `&` operator. Python also offers the `intersection()` method for accomplishing the same thing. In the cell below, use the `intersection()` method to find the intersection between the same two sets, `let1_set` and `let2_set` used in Example 7.

(**HINT:** The use of Python **methods** was covered in Class_01_02. Methods are called using **dot notation**. In this case, the `intersection()` method is attached (by the dot) to the first set and its argument is the second set.)

In [None]:
# Insert your code for Exercise 7 here



If your code is correct, you should see the following output but perhaps in a different order:

~~~text
{'e', 'c', 'd'}
~~~


### Example 8: Use `add()` method with sets

A `list` is always enclosed in square braces `[ ]`, a `tuple` in parenthesis `( )`, and similarly a `set` is enclosed in curly braces `{ }`.

Programs can add items to a `set` as they run. Programs can dynamically add items to a `set` with the **add function**. However, to add an item to a `list` you must use instead the **append function**.

In other words, items are added to `sets` using the **add function** while items are added to `lists` using the **append function**.

In [None]:
# Example 8: Use add() method

# Create a new empty set called mySet
mySet = set()

# Add letter `a` to the empty set
mySet.add('a')

# Keeping adding letters to the set
mySet.add('b')
mySet.add('c')

# Try to add a duplicate letter `c`
mySet.add('c')

# Print out the set
print(mySet)


If the code is correct, you should see the following output but perhaps in a different order:

~~~text
{'a', 'b', 'c'}
~~~

Sets can only contain unique items so there is only one `c` in the final set.

### **Exercise 8: Use `append()` method with lists**

While programs can dynamically add items to a `set` with the `add( )` method you must use the `append( )` method to add items to a `list`.

In the cell below use the `append( )` method to add an item to a list called `myList` using Example 8 as a template.

In [None]:
# Insert your code for Exercise 8 here



If your code is correct you should see the following list but perhaps in a different order:

~~~text
['a', 'b', 'c', 'c']
~~~

Lists (but not sets) can contain duplicate items so the letter `c` appears twice in this list.

## **JSON (JavaScript Object Notation)**

Data stored in a comma separated values (CSV) file must be flat. In a flat file, all the data must fit neatly into rows and columns.

Here is an example of data stored in a flat, CSV file:

![__](https://biologicslab.co/BIO1173/images/module_01/class_01_3_image02.png)

Most people refer to this type of data as structured or **tabular**. This data is tabular because the number of columns is the same for every row. Individual rows may be missing a value for a column but these rows still have the same number of columns. Tabular data is convenient for machine learning because most models, such as neural networks, also expect incoming data to be of fixed dimensions.

On the other hand, real-world information is not always so tabular. This is where **JSON** comes in. Instead of being tabular, **JavaScript Object Notation (JSON)** is a standard file format that stores data in a hierarchical format similar to **eXtensible Markup Language (XML)**.

JSON is nothing more than a hierarchy of lists and dictionaries. Programmers refer to this sort of data as semi-structured data or hierarchical data.

The following is a sample JSON file. Even though this isn't Python code, Python is able to run it!

In [None]:
# JASON example. RUN THIS CELL
{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}

If the code is correct, you should see the following output:

~~~text
{'glossary': {'title': 'example glossary',
  'GlossDiv': {'title': 'S',
   'GlossList': {'GlossEntry': {'ID': 'SGML',
     'SortAs': 'SGML',
     'GlossTerm': 'Standard Generalized Markup Language',
     'Acronym': 'SGML',
     'Abbrev': 'ISO 8879:1986',
     'GlossDef': {'para': 'A meta-markup language.',
      'GlossSeeAlso': ['GML', 'XML']},
     'GlossSee': 'markup'}}}}}
~~~~

A data scientist will generally encounter **JSON** when they access web services to get their data.

## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Microsoft Print to PDF** to generate a PDF of your Colab notebook if you are using Microsoft Windows; If you are using a Mac then **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook.

Name your PDF as `Class_01_3.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas for grading.

-------------------------------------------------

## **Lizard Tail**

The topic for the **Lizard Tail** lesson is the meaning of the "Poly A Tail" from the study of molecular genetics.

## **Poly A Tail in Molecular Genetics**


![__](https://biologicslab.co/BIO1173/images/module_01/class_01_3_image03.png)

#### What is the Poly A Tail?
The **poly A tail** is a long chain of adenine nucleotides (A's) added to the 3' end of a messenger RNA (mRNA) molecule during RNA processing. This process, known as **polyadenylation**, involves the enzyme **poly-A polymerase**. The poly A tail typically consists of 100-250 adenine residues.

#### Function
The primary functions of the poly A tail include:
1. **Stability**: It protects the mRNA from enzymatic degradation in the cytoplasm, increasing its stability.
2. **Nuclear Export**: It aids in the export of the mRNA from the nucleus to the cytoplasm, where it can be translated into a protein.
3. **Translation Efficiency**: It enhances the translation efficiency of the mRNA by ribosomes.
4. **Transcription Termination**: It plays a role in the termination of transcription.

#### Discovery
The discovery of the poly A tail is attributed to several researchers in the 1970s. Notably, **Richard J. Roberts** and **Phillip Sharp** independently discovered introns and the process of RNA splicing, which also contributed to understanding mRNA processing, including polyadenylation.

Richard J. Roberts was working at the **Cold Spring Harbor Laboratory** on Long Island, New York, when he made his discovery. Phillip A. Sharp was at the **Massachusetts Institute of Technology (MIT)** in Cambridge, Massachusetts.

Their independent discoveries of split genes and RNA splicing in 1977 were groundbreaking and earned them the Nobel Prize in Physiology or Medicine in 1993.

#### Importance

The poly A tail is crucial for gene expression and regulation. It ensures that mRNA molecules are stable enough to be translated into proteins, which is essential for the proper functioning of cells. Additionally, the length of the poly A tail can influence the lifespan of the mRNA and, consequently, the level of protein production.

In summary, the poly A tail is a vital component of mRNA that enhances its stability, aids in its export from the nucleus, and improves translation efficiency, playing a key role in gene expression and regulation.

## **The Poly A Tail: Structure and Function in Molecular Genetic**

### **Introduction**
The **Poly A Tail** is a critical component in the molecular biology of eukaryotic gene expression. It refers to a long stretch of adenine nucleotides added to the 3' end of a messenger RNA (mRNA) molecule. This modification is part of a larger set of processing steps—collectively known as RNA processing—that transforms a precursor mRNA (pre-mRNA) into a mature mRNA molecule capable of being translated into protein. The addition of this tail is termed **polyadenylation**. Unlike the coding region of the mRNA, the poly A tail is not encoded directly by the DNA template in a one-to-one fashion; rather, it is synthesized enzymatically after transcription has occurred.

### **Structure and Synthesis**
Structurally, the poly A tail is a homopolymeric chain, meaning it consists of identical repeating units—specifically, adenosine monophosphates. In mammalian cells, this tail is typically composed of **100 to 250 adenine residues**.

The synthesis of the poly A tail occurs in the nucleus immediately after transcription. The process is catalyzed by an enzyme called **poly-A polymerase (PAP)**. Before the tail is added, the pre-mRNA is cleaved at a specific site downstream of a conserved recognition sequence (often AAUAAA). Once cleaved, poly-A polymerase adds the adenine nucleotides one by one to the exposed 3' hydroxyl group of the RNA chain. This structure serves as a binding platform for a specific protein called the **Poly-A Binding Protein (PABP)**, which coats the tail and mediates many of its biological functions.

### **Biological Functions**
The poly A tail is not merely an ornamental appendage; it is essential for the life cycle of the mRNA. Its primary functions can be categorized into four key areas:

1.  **Stability and Protection:** The cytoplasm of a cell contains enzymes known as exonucleases that degrade RNA molecules. The poly A tail acts as a protective buffer at the 3' end. By binding with PABPs, the tail effectively shields the coding sequence of the mRNA from enzymatic degradation, thereby extending the molecule's half-life and ensuring it survives long enough to be translated.

2.  **Nuclear Export:** Transcription occurs in the nucleus, but protein synthesis happens in the cytoplasm. The poly A tail, in conjunction with its binding proteins, serves as a "ticket" for export. It helps the nuclear export machinery recognize mature mRNA molecules and guide them through the nuclear pore complexes into the cytoplasm.

3.  **Translation Efficiency:** The poly A tail actively promotes translation. In the cytoplasm, the PABP bound to the tail interacts with initiation factors at the 5' end of the mRNA (specifically the 5' cap). This interaction effectively circularizes the mRNA molecule, creating a closed-loop structure that recruits ribosomes more efficiently and allows for the recycling of ribosomal subunits, thereby boosting the rate of protein synthesis.

4.  **Transcription Termination:** The process of adding the poly A tail is mechanically linked to the termination of transcription. The cleavage of the nascent RNA transcript, which precedes polyadenylation, signals the RNA polymerase to disengage from the DNA template, effectively ending the transcription process.

### **Significance and Discovery**
The discovery of mechanisms like polyadenylation and RNA splicing in the 1970s revolutionized our understanding of gene regulation. Researchers **Richard J. Roberts** (Cold Spring Harbor Laboratory) and **Phillip Sharp** (MIT) were pivotal in this era, independently discovering split genes and RNA splicing, work for which they shared the **Nobel Prize in Physiology or Medicine in 1993**.


# Poly-A Tail in Molecular Genetics

## Introduction

The poly-A tail, short for polyadenylate tail, is a crucial structural feature found at the 3' end of most eukaryotic messenger RNA (mRNA) molecules. This tail consists of a long stretch of adenine nucleotides, typically 200-250 bases in length, that plays essential roles in mRNA stability, translation, and regulation of gene expression.

## Structure and Composition

The poly-A tail is a homopolymeric sequence composed exclusively of adenosine monophosphate (AMP) residues. Unlike the coding regions of mRNA that are transcribed directly from DNA templates, the poly-A tail is added post-transcriptionally through an enzymatic process. The tail is not encoded in the genomic DNA but is instead synthesized by poly-A polymerase (PAP) after transcription.

The length of the poly-A tail can vary depending on the organism, cell type, developmental stage, and specific mRNA molecule. In mammals, newly synthesized mRNAs typically have poly-A tails of approximately 200-250 nucleotides, though some specific transcripts may have shorter or longer tails.

## Polyadenylation Process

The addition of the poly-A tail occurs in the nucleus as part of pre-mRNA processing, specifically during a process called polyadenylation or 3' end processing. This process involves several key steps:

1. **Recognition of polyadenylation signals:** The polyadenylation machinery recognizes specific sequence elements in the pre-mRNA, including the highly conserved hexanucleotide AAUAAA sequence (called the polyadenylation signal) located 10-30 nucleotides upstream of the cleavage site, and a downstream element (DSE) rich in U or GU sequences.

2. **Cleavage:** The pre-mRNA is cleaved 10-30 nucleotides downstream of the AAUAAA sequence by a multi-protein complex called the cleavage and polyadenylation specificity factor (CPSF) along with cleavage stimulation factor (CstF).

3. **Poly-A tail synthesis:** Following cleavage, poly-A polymerase adds approximately 200-250 adenine nucleotides to the 3' end of the upstream cleavage product, creating the mature poly-A tail.

The polyadenylation machinery is a large complex containing multiple proteins, including CPSF, CstF, cleavage factors (CF I and CF II), and poly-A polymerase. These proteins work coordinately to ensure accurate and efficient polyadenylation.

## Functions of the Poly-A Tail

The poly-A tail serves multiple critical functions in gene expression:

### 1. **mRNA Stability**
The poly-A tail protects mRNA from degradation by exonucleases, enzymes that degrade RNA from the ends. The tail acts as a buffer, allowing the mRNA to persist in the cell long enough to be translated into protein. Deadenylation (shortening of the poly-A tail) is often the first step in mRNA decay, and the length of the tail correlates with mRNA half-life.

### 2. **Translation Enhancement**
The poly-A tail enhances translation efficiency through interactions with poly-A binding proteins (PABPs). PABPs bind to the poly-A tail and interact with translation initiation factors bound to the 5' cap structure, forming a closed-loop configuration. This circularization of mRNA facilitates ribosome recycling and enhances translation initiation.

### 3. **mRNA Localization**
The poly-A tail and associated binding proteins can influence the subcellular localization of mRNAs, directing them to specific cellular compartments where their protein products are needed.

### 4. **Regulation of Gene Expression**
Changes in poly-A tail length provide a mechanism for post-transcriptional gene regulation. Cytoplasmic polyadenylation can activate dormant mRNAs, while deadenylation can silence gene expression. This regulation is particularly important during development, cell cycle progression, and synaptic plasticity.

## Poly-A Binding Proteins (PABPs)

Poly-A binding proteins are essential mediators of poly-A tail function. The nuclear poly-A binding protein (PABPN1) binds to newly synthesized poly-A tails in the nucleus and regulates tail length. In the cytoplasm, cytoplasmic poly-A binding protein (PABPC) binds to the poly-A tail and performs multiple functions, including protecting the tail from degradation, enhancing translation, and mediating mRNA circularization.

PABPs typically bind approximately one protein molecule per 25-30 adenine residues, coating the entire length of the poly-A tail. The cooperative binding of multiple PABP molecules creates a stable ribonucleoprotein complex that shields the mRNA from degradative enzymes.

## Dynamic Regulation of Poly-A Tail Length

The poly-A tail is not static but undergoes dynamic regulation throughout the mRNA lifecycle:

### **Deadenylation**
Deadenylation is the progressive shortening of the poly-A tail by deadenylase enzymes. This process is typically the rate-limiting step in mRNA decay. Two major deadenylase complexes function in eukaryotic cells: the CCR4-NOT complex and the PAN2-PAN3 complex. When the poly-A tail is shortened below a critical threshold (typically 10-20 nucleotides), the mRNA becomes susceptible to rapid degradation through decapping and 5'-3' decay or 3'-5' decay by the exosome.

### **Cytoplasmic Polyadenylation**
Some mRNAs undergo poly-A tail extension in the cytoplasm, a process called cytoplasmic polyadenylation. This mechanism is particularly important in oocytes, early embryos, and neurons, where it allows for translational activation of dormant mRNAs in response to developmental or environmental signals. Cytoplasmic polyadenylation is mediated by cytoplasmic polyadenylation element binding proteins (CPEBs) that recognize specific sequence elements in the 3' UTR.

## Alternative Polyadenylation

Many genes contain multiple polyadenylation sites, allowing for alternative polyadenylation (APA). The choice of polyadenylation site can result in mRNA isoforms with different 3' untranslated regions (3' UTRs), which can affect mRNA stability, localization, and translation. APA is widespread, affecting more than half of human genes, and provides an important mechanism for expanding proteomic and regulatory diversity.

APA can be regulated in a tissue-specific, developmental-stage-specific, or condition-specific manner, contributing to the complexity of gene expression programs. Changes in APA patterns have been implicated in various diseases, including cancer, neurological disorders, and immune dysfunction.

## Clinical and Research Significance

Understanding poly-A tail biology has important implications for biotechnology and medicine:

- **mRNA therapeutics:** The COVID-19 mRNA vaccines utilize optimized poly-A tails to enhance mRNA stability and translation efficiency, demonstrating the practical importance of poly-A tail biology.

- **Disease mechanisms:** Aberrant polyadenylation has been implicated in various diseases, including cancer, where altered APA patterns can affect oncogene and tumor suppressor expression.

- **Diagnostic markers:** Poly-A tail length and polyadenylation site usage can serve as biomarkers for disease states or treatment responses.

- **Gene expression analysis:** Techniques like poly-A selection are routinely used in RNA sequencing to enrich for mRNA and reduce ribosomal RNA contamination.

## Conclusion

The poly-A tail represents a fundamental feature of eukaryotic mRNA biology, playing indispensable roles in mRNA metabolism, translation, and gene regulation. From its enzymatic synthesis in the nucleus to its dynamic regulation in the cytoplasm, the poly-A tail exemplifies the sophisticated mechanisms cells employ to control gene expression post-transcriptionally. As research continues to uncover new aspects of poly-A tail biology, including its regulation and role in disease, this simple homopolymeric structure continues to reveal its complexity and importance in molecular genetics.


In summary, the poly A tail is a vital regulator of gene expression. By governing mRNA stability, export, and translation efficiency, it ensures that the genetic code is not only preserved but also effectively converted into the functional proteins required for cellular life.

--------------------------------------

