<h1 style="font-size: 40px; margin-bottom: 0px;">1.1 Intro to Python</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 600px;"></hr>

This course assumes that you have no prior experience with Python or data analysis. Python has many applications, and in this course, we'll be learning how to use it in the context of bioinformatics and data analysis in biology research. If you have any questions during lab sessions, don't hesitate to ask questions. For those of you with previous experience, feel free to help out your classmates. 

This course makes use of Jupyter Notebooks, which is an open-source, online computing platform that allows you to write and run code for data analysis. Jupyter supports multiple languages including Python and R. You can create and share documents that contain live code, equations, visualizations, and text. UC Berkeley's JupyterHub service provides you with the computational infrastructure to help lower the activation energy to getting started because it already comes with resources for data analysis and bioinformatics. 

This notebook provides an environment for interactive computing in Python. You can write narrative text (like this), write and run code, and you can visualize your data and the outputs of your analyses.

If at anytime, you need to redownload the original version of a notebook for this class, you can simply delete or rename your old one, and pull the GitHub repo for this class again.

In today's class, we'll be going over the basics of Python, its syntax, and a few useful packages for data analysis. 


<h1 style="font-size: 40px; margin-bottom: 0px;">Notebook Structure</h1>
<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 600px;"></hr>

In this notebook, you can see that this section is separated from the previous sections. Each section is a **cell**, and each cell can be independently modified from another cell. Cells can be designated as **Code**, **Markdown**, or **Raw**.

<h2 style="font-size: 32px;">Code</h2>

By designating a cell as **Code**, you can execute and run programming code. In this case, the programming language is Python. 

*Click on this cell to select it.* On the top righthand side of this cell, you should see a colored vertical bar to the left and an outline around this section, indicating that this cell is selected. You should also be able to see at the top righthand corner, six icons. Click on the **Insert a cell below** icon or press **B**. You should see an empty code cell appear below this one. By default, cells are designated as code. 

*Type into your new cell, 1+1 and then press **Shift+Enter**.* This will run the code just in your selected cell and advance to the next cell, and the output will appear below it.

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #EEEEEEE; width: 600px;"></hr>

What you can see is that the code is interpreted as a mathematical expression, and when you run the cell, the notebook acts as a calculator.

<h2  style="font-size: 32px;">Markdown</h2>

You can also designate a cell as **Markdown**, which allows you to create text, add images, tables, links, and code that won't be executed.

For example, *create a new cell below this one*, and then at the top bar of this notebook, you should see a dropdown menu showing Code.

*Open the dropdown menu and select Markdown. Type into the cell 1+1, then press **Shift+Enter**.*

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #EEEEEE; width: 600px;"></hr>

You'll notice that this time, the notebook is not generating any output. This is because the cell is designated as Markdown, and so it will not run code.

It will, however, format text, images, tables, etc. based on HTML/CSS and Markdown syntax. You can see examples of this by double-clicking any of the Markdown cells to see how it's formatted in plain-text. You can hit **Shift+Enter** to return to the formatted text.

<h2  style="font-size: 32px;">Raw</h2>

The final designation is **Raw**. This is for more specific use cases. Raw cells are used when you need to render in a different code format, such as HTML or LaTeX. 

<h2 style="font-size: 32px;">Hiding and revealing cells</h2>

You can also collapse cells if you want to reduce visual clutter, reduce vertical height, or focus on a specific set of cells.

*Select the cell above this one*. You should see an arrowhead on the left. If you click on that arrowhead, you'll hide the cells below it.

Since the header is denoted with an HTML header2 tag (see below, or double click the header)
```
<h2>Hiding and revealing cells</h2>
```
The three cells below it are considered its children.

And those cells can be hidden, leaving only the header **Hiding and revealing cells**

<h2 style="font-size: 32px;">Moving cells</h2>

You can move cells around as well if you need to rearrange things or to have a specific cell run before another one.

*Select this cell.* You should see at the top righthand side, an up and a down arrow. Click the arrow to move this cell either above the preceding cell or below the following cell.

<h2 style="font-size: 32px;">Duplicating and deleting cells</h2>

You can also duplicate and delete cells with their respective icons at the top righthand side of a selected cell.

*Select this cell.* Click the **Create a duplicate of this cell below** icon to duplicate this cell.

*Select the duplicated cell.* Click the **Delete this cell** icon or double-press **D** to delete the copied cell.

<h1 style="font-size: 40px; margin-bottom: 0px;">Syntax</h1>
<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 600px;"></hr>


Like with any language, programming languages follow a specific syntax so that it can be interpreted and executed. You'll want to understand how things are interpreted in Python to have a good handle on getting Python to do what you want it to.

<h2 style="font-size: 32px;">Comments in code cells</h2>

Often, you will annotate your code with comments. To mark a line of code out as a comment, you can add <mark style="background-color: #EEEEEE"><strong>#</strong></mark> at the start of the line. The notebook will then interpret the entire line as a comment, and it won't be run as code.

To create a new line in a code cell, simply press **Enter**.

Comments are important because when you return to your code later on, you can recall what you were doing or why you were doing it, or when someone is reviewing your code, they are able to understand your thought process and how your code is set up. 

In [None]:
#Run this code cell to see what the output is.
3+3
#You should see that lines marked as a comment are not run as code or included in the output.

Comments can come in handy if you need to remove line(s) of code from running without deleting them, you can mark them out as a comment. This keeps the lines there without having them executed when you run the cell.

*Return to the previous code cell, and add a <mark style="background-color: #EEEEEE"><strong>#</strong></mark> to the front of the math expression, then hit **Shift+Enter** to run the cell.* You should see that the line that you marked out as a comment is no longer executed.

If you place a <mark style="background-color: #EEEEEE"><strong>#</strong></mark> in the middle of a line, everything in the rest of the line will be ignored.

In [None]:
#For example, try running this code cell to see how the output looks.
3+3 #+ 2+2
#You should see that the portions after a # are excluded from running.
#If you remove the #, and re-run the cell, you should see that the rest of the expression is now executed.

<h3>Multiline comments</h3>

As you probably noticed, if you want to have comments spanning multiple lines, you need to mark out each line as a comment individually, since Python does not have syntax to denote a multiline comment.

<h2 style="font-size: 32px;">Numbers and basic math</h2>

As you have seen earlier when playing with code cells, the notebook is able to act as a calculator because it can interpret the numbers and mathematical expressions you input.

Below, let's run through some quick math calculations to get a feel for running code cells. In the empty code cells, practice running some basic calculations such as the ones shown. Feel free to play with different numbers.

Addition is denoted by <mark style="background-color: #EEEEEE;"><strong>+</strong></mark>
```
3+3
```

In [None]:
#Play with some addition.

Subtraction is denoted by <mark style="background-color: #EEEEEE"><strong>-</strong></mark>
```
6-3
````

In [None]:
#Try out subtraction.

Multiplication is denoted by <mark style="background-color: #EEEEEE"><strong>*</strong></mark>
```
3*2
```

In [None]:
#Have the notebook do your multiplication.

Division is denoted by <mark style="background-color: #EEEEEE"><strong>/</strong></mark>
```
6/2
```

In [None]:
#And division too.

You can also calculate powers by using the <mark style="background-color: #EEEEEE"><strong>**</strong></mark> operator
```
3**2
```

In [None]:
#And even calculate exponents.

<h3 style="font-size: 24px;">Integer vs Float</h3>

Python has multiple ways of classifying a number, and it's important to keep this in mind as sometimes when you are executing specific functions, they require the number to be a specific data type.

An **int** is a numeric data type that represents **integers** (positive or negative whole numbers), and as such, they do not have decimals in them. This data type is often used in Python for counting and indexing. 

On the other hand, a **float** represents a **floating point value** (numbers with decimals). This data type comes into play when dealing with numbers that are not integers and for performing mathematical calculations, since values can be represented with decimals.

In the case of division, the calculated value that Python returns is a **float**, and you should be able to see in the calculation(s) you performed above that the values it outputs are represented with decimals.

<h2 style="font-size: 32px;">Text</h2>

Text in Python is represented by the data type **str** also called **strings**. Characters enclosed within either single quotations (<mark style="background-color: #EEEEEE"><strong>'...'</strong></mark>) or double quotations (<mark style="background-color: #EEEEEE"><strong>"..."</strong></mark>) are interpretted as **str**, and this includes: 
<ul>
    <li>Single characters - <mark style="background-color: #EEEEEE"><strong>'A'</strong></mark> or <mark style="background-color: #EEEEEE"><strong>"A"</strong></mark></li>
    <li>Special characters - <mark style="background-color: #EEEEEE"><strong>'?'</strong></mark> or <mark style="background-color: #EEEEEE"><strong>"?"</strong></mark></li>
    <li>Numbers - <mark style="background-color: #EEEEEE"><strong>'1'</strong></mark> or <mark style="background-color: #EEEEEE"><strong>"1"</strong></mark></li>
    <li>Sentences or phrases - <mark style="background-color: #EEEEEE"><strong>'This is a sentence.'</strong></mark> or <mark style="background-color: #EEEEEE"><strong>"This is a sentence."</strong></mark></li>
    <li>Or a combination of the above - <mark style="background-color: #EEEEEE"><strong>'Is this 1 sentence?'</strong></mark> or <mark style="background-color: #EEEEEE"><strong>"Is this 1 sentence?"</strong></mark></li>
</ul>



In [None]:
#Try defining a string in this code cell by using either single quotes or double quotes.

Strings can also be manipulated by the same operators that we use for math expressions. For example, if you want to stick mulitple strings together (concatenate), you can make use of the <mark style="background-color: #EEEEEE"><strong>+</strong></mark> operator.

In [None]:
#The + operator will stick a string to the preceding one.

Strings can be repeated by using the <mark style="background-color: #EEEEEE"><strong>*</strong></mark> operator.

In [None]:
#The * operator will repeat the string and concatenate the repeats,
#and the number of repeats is defined by a provided int data type.

If the value you provide is not an **int**, but rather a **float**, then Python will give you an error.

In [None]:
#Run this cell with a float.

This is a simple case, where the difference between **int** and **float** comes into play.

<h2 style="font-size: 32px;">Indentation</h2>

Python is sensitive to indentation, and while in other languages, indentation is used to help with readability, indentation in Python indicates a specific block of code. You'll see this become important as you run more complex code and functions, but for now, you'll just want to keep this in mind.

<h2 style="font-size: 32px;">Variables</h2>

**Variables** in Python can be thought of as a label for an object, and as such, act as a pointer.

**Objects** in Python can be basically anything, such as a **float** or **int** or later on compound data types like **list**, and every object will have an associated data value, data type, and identity.  The identity (<mark style="background-color: #EEEEEE;"><strong>id</strong></mark>) property of an object is assigned when that object is generated. What this means under the hood is that Python allocates memory for it and the <mark style="background-color: #EEEEEE;"><strong>id</strong></mark> can be thought of as an "address" for Python to find or access the object. 

You can assign an object to a variable, which can be accessed later on in the code. If you don't assign an object to a variable, Python will leave that "address" available for other uses as well, so it can be freed up if needed. However, by assigning an object to a variable, you essentially tell Python that you you'll need that object for something later on, and so Python will know to hold onto the memory allocated for that object.

This allows you to perform calculations and assign the results to a variable, which can then be used in subsequent calculations or visualizations.

To assign a value to a variable, you will use the <mark style="background-color: #EEEEEE"><strong>=</strong></mark> operator. 
```
A = 3
```
This tells Python that the variable **A** will be assigned a value **3**.

In [None]:
#In a code cell, the above will look like:

#And if you run this code cell, you'll notice that there's no output.
#That's because Python is simply assigning the value to the variable.

The value that you assign can be either an **int**, as shown above, or a **float** or **str**, and later on, you'll see that you can also assign **arrays**, **lists**, or **tuples** to variables as well. 

In [None]:
#Here, we'll assign a float to variable B.

#And then, assign a str to variable C.

#Make sure you run this code cell; otherwise, Python will not assign the values to variables B and C.

Let's check to see if our variables are properly assigned.

In [None]:
#You can simply type the variable into its own line, and run the cell.

In [None]:
#Let's check the variable B.

In [None]:
#And the variable C, as well.

With these variables assigned, we can also use them in math expressions and manipulate the values using the mathematical operators we're famililar with.

In [None]:
#Add A and B together.

In [None]:
#Repeat C by the number of times indicated by variable A.

We can also perform operations and assign the results to a specific variable.

For example:
```
D = 3**3
```
*Give it a try in the empty code cell below.*

You should notice that no output is spit out even though you are performing a calculation. This is because you're assigning the results of that calculation to a variable.

You can combine variables, values, and operations together into one.

For example:
```
E = D + 5*(A+B)
```

*Give it a try in the empty code cell below.*

In [None]:
#Play around with assigning the results of calculations to different variables in this code cell.
#Or you can create additional code cells to play around in as well.

<h3 style="font-size: 24px;">Naming conventions for Python variables</h3>

There are a few rules that restrict how you can name variables in Python.

Only a specific set of characters are allowed in names for variables:
<ul>
    <li>Alphanumeric characters - <mark style="background-color: #EEEEEE"><strong>A</strong></mark> or <mark style="background-color: #EEEEEE"><strong>Variable</strong></mark> or <mark style="background-color: #EEEEEE"><strong>A1</strong></mark></li>
    <li>Underscores - <mark style="background-color: #EEEEEE"><strong>Variable_A1</strong></mark> or <mark style="background-color: #EEEEEE"><strong>A_variable_1</strong></mark> or <mark style="background-color: #EEEEEE"><strong>_variable_A1</strong></mark></li>
</ul>

Names of variables must begin with either a **letter** or an **underscore**. They **cannot** begin with a number, so <mark style="background-color: #EEEEEE"><strong>1potato</strong></mark> is not a valid variable name.

Names are case-sensitive, so <mark style="background-color: #EEEEEE"><strong>A</strong></mark> is different than <mark style="background-color: #EEEEEE"><strong>a</strong></mark>, and <mark style="background-color: #EEEEEE"><strong>DeliciousPotato</strong></mark> is different than <mark style="background-color: #EEEEEE"><strong>deliciouspotato</strong></mark> ~but both are equally delicious~.

Names for variables cannot have a hyphen, so <mark style="background-color: #EEEEEE"><strong>delicious-potato</strong></mark> is an invalid name. You can instead use an underscore: <mark style="background-color: #EEEEEE"><strong>delicious_potato</strong></mark>.

<h2 style="font-size: 32px;">Functions, parameters, and arguments</h2>

A **function** is a block of code, consisting of a series of statements, that performs its defined specific task only when called. Generally, a function will return some kind of output value. The actions that a function will take are usually either pre-built into Python, such as the <mark style="background-color: #EEEEEE"><strong>print()</strong></mark> function, or defined by modules you import (more on this later), or defined by you. These actions can range from simple math calculations, returning an output of a data type, or more complex tasks. 

For example, let's try out the <mark style="background-color: #EEEEEE"><strong>print()</strong></mark> function in the empty code cell below.

In [None]:
#Run this cell to print the sentence: What a delicious potato.
print('What a delicious potato.')
#Note that by enclosing the sentence in single quotes ' ' Python interprets the sentence as a string.

In [None]:
#You can also print variables and the results of calculations.

#or you can first assign the result to a variable, then print the result.


In [None]:
#And you can print multiple values together using the print function.

<h3>Parameter vs argument</h3>

A function generally requires some kind of input or information that is passed to it in order to perform its action. **Parameters** are what kinds of information or inputs the function requires in order to perform its task. These are named when defining the specific function. On the other hand, an **argument** is the specific value that is passed to the function. Oftentimes, these two are used interchangeably, but essentially, the argument is the value you set a specific parameter to.  

So if we look up <u><a href="https://docs.python.org/3/library/functions.html#print" rel="noopener noreferrer" target="_blank">Python's documentation for the <strong>print()</strong> function</a></u> to peek under the hood:
```
print(*objects, sep=' ', end='\n', file=None, flush=False)
```
You can see that there's a little bit more going on than what you can see with a simple <mark style="background-color: #EEEEEE"><strong>print('Hello')</strong></mark>. Under the hood, the function has the following 5 parameters:
<ul>
    <li><mark style="background-color: #EEEEEE"><strong>*objects</strong></mark></br>
    Object(s) that you want to print. They will be converted to string beforehand.</br>
    The asterisk denotes that there's no set number of arguments that need to be passed to your function.</li>
    </br>
    <li><mark style="background-color: #EEEEEE"><strong>sep</strong></mark></br>
    Must be string type</br> 
    An optional parameter that defines the separation between multiple objects that you want to print. Set to ' ' (a single space) by default.</li>
    </br>
    <li><mark style="background-color: #EEEEEE"><strong>end</strong></mark></br>
    Must be string type</br>
    An optional parameter that defines how the print ends. Set to <mark style="background-color: #EEEEEE;"><strong>'\n'</strong></mark> (line break) by default.</li>
    </br>
    <li><mark style="background-color: #EEEEEE"><strong>file</strong></mark></br>
    An optional parameter that sets a write method. Set to None by default. This will default to <mark style="background-color: #EEEEEE"><strong>sys.stdout</strong></mark>, the standard output stream.</li>
    </br>
    <li><mark style="background-color: #EEEEEE"><strong>flush</strong></mark></br>
    An optional parameter that forcibly flushes the stream if set to True. Set to False by default.</li>
</ul>

In [None]:
#To see how this works in practice:

#Play around with the sep and the end parameter to see how the output of the print() function changes.

<h3>Built-in functions</h3>

Python has several <u><a href="https://docs.python.org/3/library/functions.html" rel="noopener noreferrer" target="_blank">built-in functions</a></u> that you can call without having to import modules beforehand.

For example, there is a <mark style="background-color: #EEEEEE"><strong>type()</strong></mark> function, that will output the data type of a value.
Recall that data values can be stored as **int** or **float** or **str**, and it's important to know what type you are working with in case a parameter in a function requires a specific data type.

<em>Test out in the code cells below how the resulting data types differ for different mathematical operations by combining both the <mark style="background-color: #EEEEEE;"><strong>print()</strong></mark> and the <mark style="background-color: #EEEEEE;"><strong>type()</strong></mark> functions. Recall that division returns a <strong>float</strong> rather than an <strong>int</strong>.</em>

In [None]:
#Example:

Recall that Python objects have their own memory allocated to it (the object's <mark style="background-color: #EEEEEE;"><strong>id</strong></mark> property. We can tell Python to return an object's memory address using the in-built <mark style="background-color: #EEEEEE;"><strong>id()</strong></mark> function.

In [None]:
#Get Python to return each id for the variables you created above.

You can also convert values to an **int**, **float**, or **str** using built-in functions.

<mark style="background-color: #EEEEEE;"><strong>int()</strong></mark> will convert a value from either a **float** or a **string** to an **int**. If your **float** has a decimal point, it will round down to the nearest integer value if positive or round up if negative. If your **string** is not a whole number, then it will give you an error.

The function <mark style="background-color: #EEEEEE;"><strong>float()</strong></mark> will convert **int** or **string** to a **float** value. Like with <mark style="background-color: #EEEEEE;"><strong>int()</strong></mark>, <mark style="background-color: #EEEEEE;"><strong>float()</strong></mark> will output an error if your **string** cannot be represented as a **float**.

*In the empty code cell below, convert some data types around, and use the type() function to see how the data value changes.*

In [None]:
#How does the id change? Test out the id() function on your altered variables.

<h3>Creating your own functions</h3>

One of the key advantages to having functions is that they allow you to repeat actions or calculations without having to retype the entire underlying code. All you need to do is to define a code once, and then you can use it repeatedly afterwards. This also has the added benefit of keeping your code tidier.

To create a function, you will use the keyword <mark style="background-color: #EEEEEE;"><strong>def</strong></mark>, which introduces a function <strong>definition</strong>. The <mark style="background-color: #EEEEEE;"><strong>def</strong></mark> keyword must be followed by the name of the function and its list of named parameters and any default arguments. And we'll make use of the keyword <mark style="background-color: #EEEEEE;"><strong>return</strong></mark>, which will spit out an output if we provide it with a variable.

In [None]:
#We'll create a function called rice that will need the parameters a, b, c.

In [None]:
#You can also include other functions within a function that you're creating.

<h3>Variable scope</h3>

You may have noticed that blocks of code defining both functions above use variables. In both cases, we have created within each function the variable name <mark style="background-color: #EEEEEE;"><strong>result</strong></mark>, which only exists within its respective function and does not exist outside of it. This is the variable's **scope**, which determines which parts of the notebook can see and use a variable. 

The consequence of this is that variable names within functions are local and do not have a relationship to variable names in another function, even in the same notebook. The scope of the variable name is visible only to its respective function; therefore, different functions can use the same variable name.

In [None]:
#Let's take a look at whether or not the variable result exists outside of the def block.

<h3>Functions from imported modules or packages</h3>

**Modules** are a single Python file, while **packages** are a collection (or directory) of modules. 

Modules and packages in Python allow for you to import defined functions into your notebook, so you don't have redefine it every time you start up a new notebook. All you need to do is at the start of your notebook, import the modules containing the functions you will use, and then you can call up those functions when needed later on. 

To import a module or package, you can make use of the <mark style="background-color: #EEEEEE;"><strong>import</strong></mark> keyword. For example, lets import a fundamental package used in data analysis called NumPy, which contains a number of functions that are used for analysis and visualization. 

For example:
```
import numpy
```
This will import the NumPy package, but you can also assign a name to the package when you import it. That way, it's easier to call up later on.

For example:
```
import numpy as np
```
This will import NumPy package under the name <mark style="background-color: #EEEEEE;"><strong>np</strong></mark>. 

After importing NumPy, if we wanted to use NumPy's square root function, we can input the following:
```
np.sqrt(x)
```

If we did not assign the name <mark style="background-color: #EEEEEE;"><strong>np</strong></mark> to NumPy, then to call up NumPy's square root function:
```
numpy.sqrt(x)
```
Note that if you did assign a name to NumPy, then <mark style="background-color: #EEEEEE;"><strong>numpy</strong></mark> will not work.

In [None]:
#Let's import the NumPy package.

In [None]:
#Now you can make use of the functions in NumPy

In [None]:
#You can perform functions on variables you have assigned.

In [None]:
#You can also assign the output of your function to a variable.

<h2 style="font-size: 32px;">Boolean values</h2>

**Boolean** values are one of two values: <mark style="background-color: #EEEEEE;"><strong>True</strong></mark> or <mark style="background-color: #EEEEEE;"><strong>False</strong></mark>, and this data type is used to represent the truth values of expressions, such as conditional statements.

Boolean values are a subtype of integer, where <mark style="background-color: #EEEEEE;"><strong>True</strong></mark> is equivalent to <mark style="background-color: #EEEEEE;"><strong>1</strong></mark> and <mark style="background-color: #EEEEEE;"><strong>False</strong></mark> is equivalent to <mark style="background-color: #EEEEEE;"><strong>0</strong></mark>.

We can evaluate variables and values by using conditional statements that compare values or variables to each other, and the resulting output is a Boolean value. 
```
time = 35
space = 40
```
So we have specific values assigned to two variables <mark style="background-color: #EEEEEE;"><strong>time</strong></mark> and <mark style="background-color: #EEEEEE;"><strong>space</strong></mark>. We can make use of different comparison operators, such as <mark style="background-color: #EEEEEE;"><strong>&equals;&equals;</strong></mark>, which checks if the two values are equal to each other.
```
time == space
```
We can do the same with other comparison operators:
<ul>
    <li><mark style="background-color: #EEEEEE;"><strong>&excl;&equals;</strong></mark> - ...not equal to...</li>
    <li><mark style="background-color: #EEEEEE;"><strong>&gt;</strong></mark> - ...greater than...</li>
    <li><mark style="background-color: #EEEEEE;"><strong>&lt;</strong></mark> - ...less than...</li>
    <li><mark style="background-color: #EEEEEE;"><strong>&gt;&equals;</strong></mark> - ...greater than or equal to...</li>
    <li><mark style="background-color: #EEEEEE;"><strong>&lt;&equals;</strong></mark> - ...less than or equal to...</li>
</ul>

One example where Boolean values will show up is in image analysis. When you set thresholds, you are creating conditional statements through which you pass the information contained in your image, which will then output for each value a <mark style="background-color: #EEEEEE;"><strong>True</strong></mark> or <mark style="background-color: #EEEEEE;"><strong>False</strong></mark> result for each pixel.

In [None]:
#Play around with Boolean operators to see how they work and the Boolean outputs

You can also define how code blocks execute based on the Boolean output of a function, allowing you to control how you want a function to operate.

In [None]:
#First define a function:

<h2 style="font-size: 32px;">Compound data types</h2>

There are multiple ways in which Python stores compound data types. You have mutable compound data types: lists, arrays, and dictionaries, and you have immutable compound data types: tuples and sets. 

Much like with single value data types, it's important to know which compound data type you are working with because Python handles each one differently, and some functions that you'll use for data analysis and bioinformatics may only accept a specific data type to carry out its function.

<h3>Lists</h3>

In Python, **lists** are a type of data structure that can be created and manipulated for data analysis and modeling. The objects within a list are referred to as elements, and they can be heterogeneous. So a list can contain an **int**, a **float**, and a **str**.

<strong>Creating a list</strong>

To create a list, you can denote a list with square brackets <mark style="background-color: #EEEEEE;"><strong>[...]</strong></mark> and use commas to separate out each element of your list.
```
[1, 5, "potato", 3.1415]
```
You can see that this line will generate a one-dimensional list containing 2 **ints**, 1 **str**, and 1 **float** data types, and the list itself is considered an object as well.

In [None]:
#Create a list below, and then output the data type of the list.

Like with other objects, you can assign a list to a variable.

In [None]:
#Assign a list to a variable then output the data type.

<strong>Accessing elements in a list</strong>

Lists are indexed with the very first element at position <mark style="background-color: #EEEEEE;"><strong>[0]</strong></mark>.

So you can use the index to access the elements of an existing list by calling it using the following syntax.
```
potato_list[0]
```
In this case, you are telling Python that you want to get the element at position <mark style="background-color: #EEEEEE;"><strong>[0]</strong></mark> (index value of 0) in the list <mark style="background-color: #EEEEEE;"><strong>potato_list</strong></mark>.

So if you wanted to pull out the 3rd element of <mark style="background-color: #EEEEEE;"><strong>potato_list</strong></mark>:
```
potato_list[2]
```
Recall that because the first element is indexed as position 0, the third element will be indexed as 2.

If you want to access the last element of a list, and you don't know the final position, you can use the index -1.
```
potato_list[-1]
```

To pull out not just one element, but multiple elements (a slice) from a list, you can use slice notation <mark style="background-color: #EEEEEE;"><strong>[start:end:step size]</strong></mark>. Note that the <mark style="background-color: #EEEEEE;"><strong>end</strong></mark> actually signifies the position of the value that is one more than the last list value you want Python to return.
```
potato_list[0:3]
```
This returns the elements in from position 0 to 2 (i.e. the first three elements in <mark style="background-color: #EEEEEE;"><strong>potato_list</strong></mark>.


If we also include a step size:
```
potato_list[0:6:2]
```
This returns a list of every other element (because the step size is 2) for elements in position 0 to 5 (one less than 6).

An alternative syntax is to use the <mark style="background-color: #EEEEEE;"><strong>slice()</strong></mark> function. It follows the same argument breakdown as the <mark style="background-color: #EEEEEE;"><strong>[start:end:step size]</strong></mark>, but instead of colons, you separate each argument with commas.
```
potato_list[slice(0, 6, 2)]
```

In [None]:
#Let's pull some elements from our list.

In [None]:
#Then let's determine what data type they are.

<strong>List properties</strong>

If you want to find out how many elements are in a list (i.e. the list's length), you can use the <mark style="background-color: #EEEEEE;"><strong>len()</strong></mark> function.
```
len(potato_list)
```
This will return the length of <mark style="background-color: #EEEEEE;"><strong>potato_list</strong></mark>.

In [None]:
#Try it out in this cell

<h3>Arrays</h3>

**Arrays** are a type of data structure containing a homogeneous data type. Unlike a list, an array cannot store both a **float** and a **str** in the same array, but it can hold multiple elements of an **int** or a **float** or a **str**. 

A key difference under the hood between a list and an array is that contiguous memory allocation for data storage makes arrays more memory efficient than lists when retrieving information within it.  

<strong>Creating an array</strong>

To create an array, you can make use of NumPy's array function <mark style="background-color: #EEEEEE;"><strong>numpy.array()</strong></mark>, which creates an array object called <mark style="background-color: #EEEEEE;"><strong>ndarray</strong></mark>.

In [None]:
#Since we've already imported the NumPy package earlier, we don't need to do it again.
#We can just call up the array function.

Like with lists, you can have Python pull and return specific elements or a slice of your array, and you can do so with the same functions.

In [None]:
#Play around with it here.

<h3>List vs Arrays</h3>

Some other notable characteristics between a list and an array are:

<table>
    <tr>
        <th style="background-color: #EEEEEE; border: 1px solid; border-color: #000000;"><strong>List</strong></th>
        <th style="background-color: #EEEEEE; border: 1px solid; border-color: #000000;"><strong>Array</strong></th>
    </tr>
    <tr>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Heterogeneous data type</td>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Homogeneous data type</td>
    </tr>
    <tr>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Created by <mark style="background-color: #EEEEEE;"><strong>[...]</strong></mark></td>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Created by an array function</td>
    </tr>
    <tr>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Cannot handle math operations</td>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Can handle math operations</td>
    </tr>
    <tr>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Can output list without looping</td>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Requires a loop to output or access elements</td>
    </tr>
    <tr>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">Larger memory requirement</td>
        <td style="background-color: #FFFFFF; border: 1px solid; border-color: #000000;">More compact memory size</td>
    </tr>
</table>

In [None]:
#Compare how Python handles math operations for lists

In [None]:
#Compare that output to this one for arrays

Something interesting about how NumPy arrays work compared to a list is that while the <mark style="background-color: #EEEEEE;"><strong>ndarray</strong></mark> has its own <mark style="background-color: #EEEEEE;"><strong>id</strong></mark>, the elements within it do not. Only when the element need to be accessed does Python create an object and allocates memory for it, so this makes it more memory compact than a list, in which each element is its own object with allocated memory.

<strong>Modifying lists and arrays</strong>

Lists and arrays are considered mutable data types in Python. This means that they can be changed or altered without changing the specific way it is accessed by Python.

In [None]:
#Let's take a look at our list and how it's accessed by Python

The output that you get is a numerical assignment to your object (<mark style="background-color: #EEEEEE;"><strong>potato_list</strong></mark>), which if you recall, acts as a memory "address" for Python to access the data referenced by a variable.

If you alter the list by adding additional elements to it by using the <mark style="background-color: #EEEEEE;"><strong>append()</strong></mark> function, <mark style="background-color: #EEEEEE;"><strong>id</strong></mark> property for the list is unchanged, while the elements within the list are updated.

In [None]:
#Let's add some additional elements to potato_list using the list_name.append() funtion

You can see that the memory address is unchanged even though you have modified the contents of <mark style="background-color: #EEEEEE;"><strong>potato_list</strong></mark>.

The same is true for an <mark style="background-color: #EEEEEE;"><strong>ndarray</strong></mark>. 

In [None]:
#Let's take a look at the identity of our ndarray.

In [None]:
#Now let's add additional elements to it using np.append() function.

In [None]:
#Then we can see if the id property changes for our ndarray after we mutate it.

<h3>Dictionaries</h3>

**Dictionaries** are another type of compound data structure that allows us to associate a value to a specific key for more efficient access to that value. While we can use lists to perform this same function, dictionaries are faster. In a dictionary, values are mapped to a unique **key** that can be used to quickly find a specific object (the mapped vlaue). The speed comes from the fact that dictionary keys are **hashable**, meaning that they can be assigned an integer value which then is a shortcut to find the associated value.

In [None]:
#Test it out here, and you can see that an object's id is different from its hash value.

<strong>Creating a dictionary</strong>

To create a dictionary, you will make use of the curly brackets <mark style="background-color: #EEEEEE;"><strong>{...}</strong></mark> or the <mark style="background-color: #EEEEEE;"><strong>dict()</strong></mark> function. You'll pair up a key with a value by following the <mark style="background-color: #EEEEEE;"><strong>key:value</strong></mark> syntax and separating pairs with commas.

In [None]:
#Let's create a simple table with a couple of key:value pairs.

To modify a dictionary, we can make use of different functions, such as <mark style="background-color: #EEEEEE;"><strong>update()</strong></mark> to add key:value pairs, or <mark style="background-color: #EEEEEE;"><strong>del</strong></mark> to delete key:value pairs.

Dictionaries are useful for iterating and counting, where you can iterate through a dataset to get the frequency of some observation.

<h3>Tuples</h3>

**Tuples** are another compound data type in Python, and it's largely similar to a list. The key difference is that while lists are changeable (mutable), tuples cannot be changed (immutable). This means that while you can modify a list or array with functions such as <mark style="background-color: #EEEEEE;"><strong>append()</strong></mark>, you cannot do so with a tuple. Tuples can be useful for when you have a dataset that you don't want to accidentally modify since tuples are immutable, or when you want to use the elements as a dictionary key.

Tuples can contain a heterogeneous set of data types, so in that instance, they are similar to a list. The elements stored within a tuple are indexed by integers, like lists and arrays. So you can have Python return specific elements of a tuple.

<strong>Creating a tuple</strong>

The syntax for creating a tuple in Python is with parentheses <mark style="background-color: #EEEEEE;"><strong>(...)</strong></mark> with each element of a tuple separated by a comma. 
```
(1, 4, 5, 19, 25, 60)
```

Since parantheses are also used in Python for grouping, there's an additional consideration you need to keep in mind if you ever want to create a tuple with a single element. You'll need to add a hanging comma after the single element to specify that it is a tuple containing just one element.
```
(1,)
```

In [None]:
#Let's create a tuple.

In [None]:
#Get Python to return a specific element and slice of our tuple.

#To get a slice, you can follow the usual slice syntax.

In [None]:
#Let's look under the hood. Is each element already assigned an id or is it created only when the object is accessed?

<strong>Packing and unpacking tuples</strong>

Tuple packing occurs when you assign multiple values to a single variable. Python will interpret the assignment to generate a tuple from the different values that you assigned.
```
packed_potato = 1, 'hot potato', 2, 'tired potato', 3
```
The five objects to the right of the <mark style="background-color: #EEEEEE;"><strong>=</strong></mark> operator will be packed together into a single tuple that is assigned to the variable <mark style="background-color: #EEEEEE;"><strong>packed_potato</strong></mark>

You can unpack a tuple to extract the values stored within it as individual variables:
```
u1, u2, u3, u4, u5 = packed_potato
```
In this case, you have multiple variables on the left side of the <mark style="background-color: #EEEEEE;"><strong>=</strong></mark> operator, which Python will then interpret as unpacking the tuple into its individual objects and assigning each object to each variable. This requires that you have provided as many variables on the left as there are objects in the tuple.

In [None]:
#Pack a tuple and then unpack the tuple.

<h3>Sets</h3>

The last compound data type that you may encounter is a **set**. Sets can contain either homogeneous data types or heterogeneous data types. While you can add and remove elements within a set, the elements themselves cannot be mutable. One notable distinction from the other compound data types is that sets are unordered and also unindexed. Additionally, you can't have duplicate elements in a set, and since sets are unordered, each time you use a set, the order can be different. This means that you can't use the usual indexing and slicing syntax to retrieve information from a set. There is a special set of operators that allow you to operate on and pull elements from a set.

<strong>Creating a set</strong>

Sets are denoted with curly brackets <mark style="background-color: #EEEEEE;"><strong>{...}</strong></mark> like dictionaries, but instead of having key:value pairs, you just have each object separated by commas.
```
{'tomato', 1, 'potato', 3, 'rice', 'wheat', 6}
```

In [None]:
#Create a few sets below with some shared and some not shared values. Assign them to variables.

<strong>Updating a set</strong>

You can add or remove elements from a set using either the <mark style="background-color: #EEEEEE;"><strong>update()</strong></mark> function to add something, or the <mark style="background-color: #EEEEEE;"><strong>remove()</strong></mark> function to remove an element. For <mark style="background-color: #EEEEEE;"><strong>remove()</strong></mark>, you will get an error if what you want to remove is not contained within the set. An alternative function is <mark style="background-color: #EEEEEE;"><strong>discard()</strong></mark>, which won't raise an error if the value is not present in the set. 

In [None]:
#Add additional values to your set.

<strong>Comparing sets</strong>

We can use operators such as:
<ul>
    <li><mark style="background-color: #EEEEEE;"><strong>&vert;</strong></mark> or <mark style="background-color: #EEEEEE;"><strong>union()</strong></mark> to output all elements of both sets without duplication</li>
    <li><mark style="background-color: #EEEEEE;"><strong>&amp;</strong></mark> or <mark style="background-color: #EEEEEE;"><strong>intersection()</strong></mark> to output the shared elements of both sets</li>
    <li><mark style="background-color: #EEEEEE;"><strong>&minus;</strong></mark> or <mark style="background-color: #EEEEEE;"><strong>difference()</strong></mark> to output the elements in the first set that are not in the second</li>
    <li><mark style="background-color: #EEEEEE;"><strong>&Hat;</strong></mark> or <mark style="background-color: #EEEEEE;"><strong>symmetric_difference()</strong></mark> to output the elements in the that are either in the first set or the second set but not in both sets</li>
</ul>

In [None]:
#Try playing around with these operators and the sets that you created to get a feel for how they work.