<a href="https://colab.research.google.com/github/Nevada-Bioinformatics-Center/python_workshop/blob/main/PythonCrashCourse_Notebook_intermediate.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Python_logo_and_wordmark.svg/2560px-Python_logo_and_wordmark.svg.png" width=300><br>
<font color='#4B8BBE' size=20 >Crash Course  </font>


In this notebook, we'll be learning the fundamentals of Python (all of which apply to programming in general!). We'll be working through a dataset of the most streamed Spotify songs compiled in 2023 <img src="https://upload.wikimedia.org/wikipedia/commons/8/84/Spotify_icon.svg" width=20>.


____
# Learning Objectives 🙋
## At the end of today, you'll be able to:
* Understand the basic components of a Jupyter notebook 📓
* Write and debug basic Python code 🐍
* Use functions, loops, conditional statements, and apply programming fundamentals applicable across computing languages 💻
* Analyze a dataset using Python libraries built for data science 💭 🧠
* Learn and discuss best practices for data visualization to communicate your results 📢 📈

**<font color="red" size=4> IMPORTANT: This notebook is READ-ONLY. To edit and run this notebook, follow the steps in the next section.</font>**

____
## 💾 Before you start - Save this notebook!

When you open a new Colab notebook from the WebCampus (like you hopefully did for this one), you cannot save changes. So it's  best to store the Colab notebook in your personal drive `"File > Save a copy in drive..."` **before** you do anything else.

The file will open in a new tab in your web browser, and it is automatically named something like: "**CopyofPythonCrashCourse_Notebook_intermediate.ipynb**". You can rename this to just the title of the assignment "**PythonCrashCourse_Notebook_intermediate.ipynb**". Make sure you do keep an informative name (like the name of the assignment) to help you be able to come back to this after you complete this part of the assignment.

**Where does the notebook get saved in Google Drive?**

By default, the notebook will be copied to a folder called “Colab Notebooks” at the root (home) directory of your Google Drive.

___

# Section 1: Setup and import files 🧰 🛠

There are several steps we need to do to access the dataset before we can load the file:
1. Copy the data file to working memory to run in this notebook
2. Check that we have it available to us<br>

Other options (backup if the above doesn't work):
1. Mount our gdrive (connect to this notebook)
2. Fetch the data file from my github repository
3. Change into our directory where the file is located


We will start by downloading the data file (`spotify-2023-clean.csv`) to access for today.

>**Task:** Run the cell below. If it works, you should see text that looks like this:
</br><br>
<img src="https://github.com/tmckim/python-crash-course/blob/main/imgs/wget_filesuccess.png?raw=1"/>


In [None]:
### RUN THIS CELL ###
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1hv66NjJSa98UUVsbEppZjtOCInmbbRES' -O spotify-2023-clean.csv

> **Task:** On the left side of the Colab window, click the 📁 icon, and double click on your files to see what the raw data looks like. You should see a file called `spotify-2023-clean.csv`.

## 📚 Python Libraries

Python comes with many useful built in classes and functions - lists, dictionaries, counters, and their associated functions, are some examples. You can accomplish many simple tasks using native Python code.

</br>
<img src="https://cdn.pixabay.com/photo/2016/03/26/22/21/books-1281581_960_720.jpg" width=600>
</br>

However, just like in many computer languages, the beauty of Python is that other people have spent much time developing useful code that they have tested robustly and wish to share with others. These are called libraries or packages.  We can import popular packages into the Python/Colab environment and if the package has not already been installed on our version of Python, we can easily install the package and use it.

(In Colab, or in a linux environment if you run Python locally, we can easily install new packages using ```!pip install [PYTHON PACKAGE]```)

For this Colab notebook we will be working with four packages that typically come pre-installed on most versions of Python. They are:

- **numpy**: A library for working with numerical lists, called arrays
- **pandas**: A library for working with two dimenional lists, called dataframes
- **matplotlib**: A package for plotting, commonly used in tandem with numpy and pandas
- **seaborn**: A package for plotting, commonly used in tandem with pandas

</br><img src="https://miro.medium.com/max/765/1*cyXCE-JcBelTyrK-58w6_Q.png" width=200> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1200px-Pandas_logo.svg.png" width=200> </br><img src="https://matplotlib.org/stable/_static/logo_dark.svg" width=200> <img src="https://seaborn.pydata.org/_images/logo-wide-lightbg.svg" width=200> </br></br>
We can import these libraries (or certain classes/functions from these libraries) into our local runtime using the following code. The nicknames I used (np, plt, np, ans sns) are pretty standard nicknames that most Python programmers use, though you could use anything.


>**Task:** Run the following cell to import the relevant libraries.


In [None]:
### RUN THIS CELL ###
# Import our plotting package from matplotlib and shorten it to plt
import matplotlib.pyplot as plt

# Specify that all plots will happen inline & in high resolution
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Import pandas for working with databases and shorten it to pd
import pandas as pd

# Import numpy and shorten it to np
import numpy as np

# Import plotting package seaborn and shorten to sns
import seaborn as sns

# Print statement to confirm
print('Packages imported!')

>**Task:** Run this cell to set formating options. These can be changed according to your preferences.

In [None]:
### RUN THIS CELL ###
# We want to see all the columns/rows in the dataset
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.display.float_format = '{:.1f}'.format  # don't use scientific notation

## 📫 Importing Data

Read the file from the colab files into your runtime. There are several options for reading a file in, but we will use the `pandas` package (which we imported as `pd`): <br>
*   pandas ```read_table()``` or ```read_csv()``` function, which reads a file with multiple data types into a dataframe.


In [None]:
### RUN THIS CELL ###
## Import file (cleaned up)
spotify = pd.read_csv('spotify-2023-clean.csv')
print('data imported!')

In [None]:
### RUN THIS CELL ###
## This is needed to prepare the dataframe for the workshop today. We ran these commands in the last session.
from datetime import datetime

# Get the current date and time
current_datetime = datetime.now()

# Extract the year attribute
current_year = current_datetime.year

spotify_year = spotify.released_year
age = current_year - spotify_year

spotify.loc[:, "age"] = age
print(spotify.columns)

## 📚 Pre 7.0 Dictionaries

Here we have a new type of built-in data structure, dictionaries, which we will be using more later. <br>
Dictionaries store data in key:value pairs. Also note that we have a new type of brackets or braces that are curly `{ }`

<br>

```
new_type = {'Number of petals': [8, 34, 5], 'Name': ['lotus', 'sunflower', 'rose']}
type(new_type)
dict
```

For example, here are key: value pairs: <br>
`'Number of petals': [8,34,5]` <br>
`'Name': ['lotus','sunflower', 'rose']` <br>
The keys 🔑 are the strings for the column names, and the values 🔟 are the data that go in the rows of the table. <br><br>

Dictionaries are structures which can contain multiple data types. They contain key-value pairs. For each unique key, the dictionary has one value. Keys can be various data types: strings, numbers, or tuples, while the corresponding values can be any Python object.<br>

You cannot access values of the dictionary by the indexes (like you can in lists or arrays). But you can access them by the key. Due to this feature dictionaries don't allow duplicated keys.

You can also access just the keys or just the indexes by `.keys()` and `.values()` methods.
<br><br>
To find out even more, read the Python [documentation](https://docs.python.org/3/tutorial/datastructures.html#dictionaries). <br>

Other examples and [tutorial](https://www.w3schools.com/python/python_dictionaries.asp)

### Create and Read Dictionaries

In [None]:
# literal
gene_len = {"AT1G01010": 429, "AT1G01020": 245, "AT1G01030": 358}

# from pairs
pairs = [("AT1G01010", 429), ("AT1G01020", 245)]
d = dict(pairs)

# empty then fill
meta = {}
meta["sample"] = "S1"
meta["species"] = "Arabidopsis"

print(gene_len)
print(d)
print(meta)

### Access Values

In [None]:
print(gene_len["AT1G01010"])     
print(gene_len.get("AT1G99999")) 
print(gene_len.get("AT1G99999", 0))

### Add, Update, Delete, or Merge

In [None]:
print(gene_len)

# add
gene_len["AT1G01040"] = 1910      
print(gene_len)

# update
gene_len["AT1G01020"] = 246       
print(gene_len)

# delete and return value
removed = gene_len.pop("AT1G01030")         
print(gene_len)
print("We removed: ", removed)

# delete without return value
del gene_len["AT1G01040"]         
print(gene_len)



In [None]:
##Merging - (right wins on conflicts)

a = {"A": 1, "B": 2}
b = {"B": 20, "C": 3}
c = a | b   
print(c)


In [None]:
##Membership and lengths
print("AT1G01010" in gene_len )
print(len(gene_len))           


### Sorting Dictionaries

In [None]:
# add data
gene_len["AT1G02040"] = 1910     
gene_len["AT1G01080"] = 1234    
gene_len["AT1G01050"] = 678
print(gene_len)

In [None]:
# by key
for gid in sorted(gene_len):
    print(gid, gene_len[gid])

In [None]:
# by value (ascending)
for gid, length in sorted(gene_len.items(), key=lambda kv: kv[1]):
    print(gid, length)

In [None]:
# top-N by value
top2 = sorted(gene_len.items(), key=lambda kv: kv[1], reverse=True)[:2]
print(top2)

### Defaults with `setdefault`

In [None]:
# group transcripts by gene prefix (drop isoform like .1/.2)
rows = ["AT1G01010.1", "AT1G01020.2", "AT1G01010.2"]
by_gene = {}
for tid in rows:
    gid = tid.split(".")[0]
    by_gene.setdefault(gid, []).append(tid)

print(by_gene)


### Counting with a dict (and with `Counter`)

In [None]:
counts = {}
hits = ["A","B","A","C","A","B"]
for h in hits:
    counts[h] = counts.get(h, 0) + 1

print(counts)

In [None]:
from collections import Counter
Counter(hits).most_common()   # [('A', 3), ('B', 2), ('C', 1)]


### Dictionary comprehensions

In [None]:
# transform values
kb = {gid: length/1000 for gid, length in gene_len.items()}
print(kb)

# filter+transform
long_genes = {gid: L for gid, L in gene_len.items() if L >= 500}
print(long_genes)


### Nested Dictionaries


In [None]:
samples = {
    "S1": {"reads": 1200000, "gc": 0.41},
    "S2": {"reads": 980000,  "gc": 0.43},
}
print(samples["S1"]["reads"])


## Pre 7.0 Coding Exercise

Build a dictionary lookup for the following list so that `status["S2"] == "FAIL"`


In [None]:
### YOUR CODE HERE ###
rows = [("S1", "OK"), ("S2", "FAIL"), ("S3", "OK")]

In [None]:
#@title Example Solution
rows = [("S1", "OK"), ("S2", "FAIL"), ("S3", "OK")]
status = dict(rows)
print(status)

## Pre 7.1 Coding Exercise

Inverse the following dictionary so that `gene_to_symbol` is `symbol_to_gene`

In [None]:
### YOUR CODE HERE ###
gene_to_symbol = {"AT1G01010":"NAC001","AT1G01020":"ARV1"}

In [None]:
#@title Example Solution
gene_to_symbol = {"AT1G01010": "NAC001", "AT1G01020": "ARV1"}
symbol_to_gene = {v: k for k, v in gene_to_symbol.items()}
print(symbol_to_gene)

## Pre 7.2 Coding Exercise

Keep only entries in `gene_len` with length ≥ 300.

In [None]:
### YOUR CODE HERE ###


In [None]:
#@title Example Solution
filtered = {gid: L for gid, L in gene_len.items() if L >= 300}
print(filtered)

## Pre 7.3 Coding Exercise

Print the top 3 longest genes (id and length) from `gene_len`

In [None]:
### YOUR CODE HERE ###


In [None]:
#@title Example Solution
top3 = sorted(gene_len.items(), key=lambda kv: kv[1], reverse=True)[:3]
for gid, L in top3:
    print(gid, L)

## 🧘 7.0 Loops




A **for loop** lets you repeat an action for each item in a collection (a list, string, file lines, etc.).
Python’s for loops iterate over items, not over numeric indexes (unless you ask for indexes).

`iterable` = something you can loop over (list, tuple, string, range, dict, generator…)
`item` = each element from the iterable


```python
for item in iterable:
    # do something with item


In [None]:
fruits = ["apple", "banana", "cherry"]
for f in fruits:
    print(f.upper())


## Looping with `range()`

Use `range()` to generate sequences of numbers.

In [None]:
for i in range(5):
    print(i)

In [None]:
# start, stop (exclusive), step
for i in range(2, 10, 2):  # 2, 4, 6, 8
    print(i)

## Getting both index and value: `enumerate()`


In [None]:
letters = ["a", "b", "c"]
for idx, val in enumerate(letters):         # default start=0
    print(idx, val)



In [None]:
for idx, val in enumerate(letters, start=1):  # start index at 1
    print(idx, val)

## Looping over multiple lists: `zip()`

If lists differ in length, `zip` stops at the shortest. Use `itertools.zip_longest` to keep going.

In [None]:
genes   = ["AT1G01010", "AT1G01020", "AT1G01030"]
symbols = ["NAC001",    "ARV1",      "NGA3"]

for g, s in zip(genes, symbols):
    print(f"{g}\t{s}")

## Looping over dictionaries

In [None]:
gene_lengths = {"AT1G01010": 429, "AT1G01020": 245, "AT1G01030": 358}

# keys (default)
for gid in gene_lengths:
    print(gid, gene_lengths[gid])



In [None]:
# key/value pairs
for gid, length in gene_lengths.items():
    print(gid, length)

## `break`, `continue`, and `pass`

`break`: stop the loop early

`continue`: skip to the next iteration

`pass`: do nothing (placeholder)

In [None]:
nums = [3, 7, 11, 12, 18]
for n in nums:
    if n % 2 == 0:
        print("first even:", n)
        break                       


In [None]:
for n in range(1, 8):
    if n == 4:
        continue                     
    print(n)


## The `for`/`else` pattern (search with a fallback)

`else` runs only if the loop didn’t break.

In [None]:
target = "NGA3"
hits = ["NAC001", "ARV1", "LHY", "NGA3"]

for h in hits:
    if h == target:
        print("found:", h)
        break
else:
    print("not found")  


## Building results incrementally (accumulation)

In [None]:
lengths = [429, 245, 358]
total = 0
for L in lengths:
    total += L
print("sum:", total)


## List comprehensions

Everything you can do with a simple `for` that builds a list, you can usually do with a comprehension. Use comprehensions for simple transforms/filters; use full `for` loops for multi-step logic.

In [None]:
# for-loop
squares = []
for n in range(5):
    squares.append(n*n)

print(squares)



In [None]:
# list comprehension (equivalent)
squares = [n*n for n in range(5)]

print(squares)

In [None]:
evens = [n for n in range(10) if n % 2 == 0]
print(evens)

## Coding Exercise 7.0

Review the code below that prints individual letters of the word `music`. <br>
This is a bad (inefficient) approach. It doesn't scale ❌ - <br>
It doesn't work for longer words and will error for smaller words (imagine if you had thousands of rows of data!)<br>

A better approach to print each letter in the word would be to use a for loop.

>**Task:** Write a for loop in the second cell to produce the same output from the first cell.

In [None]:
### RUN THIS CELL ###
word = 'music'
print(word[0])
print(word[1])
print(word[2])
print(word[3])
print(word[4])

In [None]:
### YOUR CODE HERE ###


>**Task:** Now test how flexible your for loop is. Try another word to see it in action!

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
word = 'music'
for char in word:
    print(char)

### **A Note on Naming** <br>
In the above example, we used the name `char` which was short for `character`. This makes sense in our case, but see the example below. You have the freedom to choose any name that you want. In the case below, if we choose `dog` for the loop variable (counter variable), it will work as long as the same name is used when we start the loop and then within the loop. <br><br> However, when writing code, it's best to use a name that is meaningful, otherwise it could be difficult to figure out what the loop is doing. Do your future self a favor and use what makes sense (and leave yourself comments!) 🙃

In [None]:
### RUN THIS CELL ###
word = 'bananas'
for dog in word:
    print(dog)

## 🏃 7.1 Coding Exercise



Use a for loop to convert the string “hello” into a list of letters:
```["h","e","l","l","o"]```
<br>

**Hint:** You can create an empty list like this:
```my_list = []```.

The solution also includes using our friend `.append` for working with lists (we practiced above).


In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
my_list = []
my_string = "hello"
for char in my_string:
    my_list.append(char)
print(my_list)

## 🏋 7.2 Coding Exercise



Above we changed the `mode` column of data into a string. We need to do this for 3 other columns in our `spotify` dataset. They are:
- `artist_name`
- `track_name`
- `key`

>**Task:** Edit the skeleton code below to use a for loop to change the datatype from object to string. If you need to reference what we did above, it was the section on 💾 Data Types.

In [None]:
### EDIT THIS CODE TO RUN ###
col_names = [..., ..., ...]

for ... in ...:
  spotify[...] = spotify[...].astype('string')

spotify.info() # use this to verify the cols are strings

In [None]:
#@title Example Solution
col_names = ['artist_name', 'track_name', 'key']

for c in col_names:
  spotify[c] = spotify[c].astype('string')

spotify.info()

# What we previously used for reference:
# spotify['mode'] = spotify['mode'].astype('string')

## ⭐ Bonus: for loops to count data

We know that there are some artists that appear to be dominating this dataset. We can use a for loop to go through and count the songs per each unique artist. This for loop is a bit more complex and uses slightly different syntax than the ones we practiced above.

This is one example of how we can write our own code to review the data manually. In a few sections, we will see we can similarly look at descriptives using methods like `groupby` instead. This is already built in with `pandas` and may be more efficient than writing this out by hand with a for loop. But, this is still good practice as we are learning! For example, you could try to produce this same output using multiple methods to compare.

In [None]:
# For loop to find and print number of songs per artist
nums = []                                                                         # This will allow us to do an overall count too

for i, artist_name in enumerate(spotify.artist_name.unique()):                    # we go through one-by-one each unique artist name - remember, we can have up to 7 together!
  artist_data = spotify[spotify.artist_name == artist_name]                       # we find all rows in our dataframe that are for the artist(s)
  artist_count = len(artist_data)                                                 # we use `len` (short for length) to get the # of rows of data (each row in dataset is a song)
  nums = np.append(nums, artist_count)                                            # we created an empty array called nums before our loop, and we can add the numbers to it for each artist
  print(f'# of songs for {artist_name}: ' + str(artist_count))                    # this prints out text to tell us how many songs were counted for each artist

print('Total # of songs in dataset: ' + str(sum(nums)))                           # this is to just double check we get out the number we expect!



---


# Section 8: Conditional Statements 🔀

We've started working with the dataset, but so far you probably feel like you know very little about all of the data it contains. That's okay- we have tools we can use to write **robust** code to find information in our dataframe when we don't necessarily know specific information yet at the start.
To do so, we can write code that uses the following logic:

 <font color='green'>If a certain condition is `True`, <font color='blue'>execute the code one way. </br><font color='orange'>If that condition is not true (`False`), <font color='magenta'>execute the code a different way.</font>

We can implement this logic using a **conditional statement**. The syntax for a conditional statement in Python looks like the following:

```
if CONDITION:
  DO SOMETHING
elif ANOTHER_CONDITION:
  DO ANOTHER THING
else:
  DO SOMETHING ELSE
```


*   **``if``**: a keyword that lets Python know you are implementing a conditional.

*   **CONDITION(AL)**: A boolean function or variable that returns the value ```True``` or ```False```. The word **boolean** means exactly that - a function or variable can only return True or False.

*   **DO SOMETHING**: The body of you code you wish to exectute if the CONDITION is True.

*   **``elif``** (optional): A keyword that lets Python know that it should execute the following body of code if the condition is True. (and the previous condition was not true)

*   **DO ANOTHER THING** (optional): The body of code you wish to execute if ANOTHER_CONDITION is True.

*   **``else``** (optional): A keyword that lets Python know that it should execute the following body of code if the condition is not true.

*   **DO SOMETHING ELSE** (optional): The body of code you wish to execute if the CONDITION or ANOTHER_CONDITION is False.



We will explore how to use each of these different components of an if-then statement in the following exercises.


We already have some built in functions in `pandas` that return a **boolean** depending on the condition you are looking for in the dataset.

[**```is_na()```**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html): Returns `True` if there are missing values (NaNs), and `False` if there is not.

>**Task:** What will the following block of code print? `True` or `False`? Run the code below to see if you are correct.

In [None]:
### RUN THIS CELL ###
pd.isna('Python Crash Course')

In [None]:
### RUN THIS CELL ###
pd.isna(np.nan)

In [None]:
### RUN THIS CELL ###
# Want to know what np.nan does? Try running just that part here
np.nan

Here is a more interesting example to examine any (and all) missing data from our dataframe.

In [None]:
### RUN THIS CELL ###
spotify.isna().sum()

Why did `sum` work with `True` and `False` values? <br>To count these, we use `True = 1` and `False = 0`. So the above `sum` is really a count of how many `True` (`nan`) values were found. <br><br>
Note: Pandas supports **missing values**. If a value is missing it will be a "nan" value. Many functions (```mean```, ```sum```, etc.) have specific parameters you can choose for how to deal with missing values (do we ignore it, do we treat it as a zero, do we return a null?). This is a decision you will need to make based on your question of interest or purpose for running your code/analyses.



We can also use logical operators with our conditional statements:
* "is equal to" `==`
* "is not equal to" `!=`
* "is less than" `<`
* "is greater than" `>`

>**Task:** Predict what Python will do with the following logical operator. Then run the cell to check your predictions. Try changing the value of ```age``` and ```day``` and see what happens.

In [None]:
age = 65
day = "Tuesday" # Tuesdays are senior nights!
if (age >=65) and (day == 'Tuesday'):
  print("You get a senior discount!")
else:
  print("Sorry, you do not quality for a senior discount.")

## ⛔ Using Conditional Statements to Select Dataframe rows

We learned above how to pull out data from an entire column (or columns). Now, we want to pull out data from rows (across multiple columns) that meets a certain condition. The logic is similar, but we need to add a bit more to our code.<br> <br>
Remember that our data is similar within a column, but each row contains a mix of data across the columns. For example, some values are text (strings) and some are numbers. <br><br>
To filter a DataFrame based on the contents by rows in `pandas`, we need to use boolean (`True` or `False`) expressions in order to select the rows we want to keep. Conditional expressions (`>`,`<`, `==`, `!=`, `<=`, ...) enable us to test whether our conditions of interest would evaluate to `True` or `False`. Typically, we want to keep data in our rows that meets a certain criteria (`True`). Let's walk through an example.

>**Task:** Run this code. Predict what you think you will see once it runs.

In [None]:
### RUN THIS CELL ###
spotify['in_apple_playlists'] > 50

We put in the name of our column `in_apple_playlists` and asked Python to find where this value is greater than > 50.<br>
You can see from the output that sometimes values in this column are `True`, meaning this corresponds to a row that the song was in more than 50 apple playlists. If you see `False`, this indicates that the value in the row did *NOT* meet our condition (greater than 50).<br>
This is helpful to illustrate the values we would see if we *only* look at the column we specified `in_apple_playlists`.

We want to take this a step further, because we don't want to work with just one column in our dataset! It's important to see what Python is doing and how it interprets the code, since this is an intermediate step in the code we will run below.<br>
We want to use the code we've written above but now apply this filter across the entire dataset. So, we want to pull out **all rows** where data was in more than 50 apple playlists, and also keep all columns in our dataset. <br><br>
To do this, we have to specify that we want to do this filtering in the context of our dataframe `spotify`. The line of code below will reference the dataframe first and then put the code we ran above inside brackets to subset the entire dataset. We will get out only rows that meet our condition `>50 apple playlists` and also still have all of the columns in our dataframe to work with.

>**Task:** Review this code and compare it to above. Also review the output.

In [None]:
### RUN THIS CELL ###
apple_subset = spotify[spotify['in_apple_playlists'] > 50]
apple_subset

In [None]:
### RUN THIS CELL ###
# How many rows met this condition?
print(len(apple_subset))

## 🧘 8.0 Coding Exercise

Choose your own adventure! 🌄
Pick something you would like to filter the dataset by and test out your code. If you need a reminder of all the columns, run the cells below to preview some more info about the data. Alternatively, type the name of the dataset so you can scroll through it to pick something you are interested in.

If all else fails and you aren't feeling creative (or are stuck and don't know where to start), use the code above and adjust it so that you find the rows of data for Queen (`artist_name`).

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
# try filtering by an artist name
spotify[spotify['artist_name'] == 'Queen']

## 🏃 8.1 Coding Exercise


Using a for loop and an if statement, in combination with the other concepts you have learned, build some code to count the number of songs in the dataframe that are over 20 years old.



In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
songs_over_20_counter = []
songs_under_20_counter = []

for i in spotify.index:
  if spotify.age[i] >= 20:
    songs_over_20_counter.append(1)
  else:
    songs_under_20_counter.append(1)

print('# of songs over 20:', sum(songs_over_20_counter))
print('# of songs under 20:', sum(songs_under_20_counter))
print('Check of total # of songs:', sum(songs_over_20_counter) + sum(songs_under_20_counter))


---


# Section 9: Reviewing the Dataset 👀

So far we've learned about how to use Python and worked with parts of the dataset. Maybe you feel like you've started to know the dataset, but there's alot of data! Let's take stock of the data we have available to us. I will demonstrate some of my most commonly used functions and methods. These will help to get at what data we have and how find out the info we want to describe the data from our dataframe.

>**Task:** Let's start by reviewing the dataframe like we already have above. We will use `.info()` to remind ourselves what info we have. Run this cell.

In [None]:
### RUN THIS CELL ###
spotify.info()

## `Unique` values 🦄

Luckily, we don't have to do everything by eye 👀 and scroll the whole way through the table (although I do recommend doing this to get a feel for the data!). <br><br>
We can use the function [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) in `pandas` to help us out. It will return the unique values from the column we specify based on the order of appearance (it does not sort them in any way).<br> <br>

Helpful info can also be found [here](https://favtutor.com/blogs/pandas-unique-values-in-column). <br>

Below, we will first call the name of our dataframe (`spotify`) then reference the column name (`.column_name`) and finally use `.unique()` to figure out the possible values in our column. <br><br>
The code we will use looks like this: ```dataframe.column_name.unique()```

## 🧘 9.0 Coding Exercise

Until now, we don't actually know what all the options are for `artist_count` in the dataset - did you manually count above 😉?

>**Task:** Use the example syntax above to write code that will show the `unique` numbers contained in the `artist_count` column. This will give us more info about artist collaborations in the dataset.

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
print('Counts of artists:', spotify.artist_count.unique())

## Sorting 🔽

As you can see, this isn't sorted in order, but the numbers appear based on how they appeared in the dataset. Many times it helps us to arrange our data in logical way that makes sense for our brain 🧠. Sorting is particulary helpful for this!

If you want to `sort` them, can you find a way to do so?

**Hint:** There are many ways to do so! If you are ever stuck, you can try googling the combination of the package you are using ([`numpy`](https://numpy.org/doc/stable/reference/generated/numpy.sort.html)) and what you'd like to do with the data (`sort`).

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
print('Counts of artists using np.sort:', np.sort(spotify.artist_count.unique()))

In [None]:
#@title Example Solution- Another Way
print('Counts of artists using sorted:', sorted(spotify.artist_count.unique()))

## 9.1 🏃 Coding Exercise

It's also possible to sort the entire dataframe. `pandas` uses [`sort_values`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) to arrange the dataframe depending on what you'd like to sort by.

Let's take a look at the dataframe sorted by `released_year`. Just how old are the oldest songs in our dataset?

>**Task:** Use the skeleton code here and add it to a cell below. Sort based on `released_year`. <br>
`sorted_df = dataframe.sort_values(by=['...'])`

Note: `sort_values` defaults to the order of sorting as `ascending = True`. If you want to change this, add this parameter and set it to false to sort from most recent to oldest instead.

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
# Surpring insight- christmas songs in our list?
sorted_year = spotify.sort_values(by=['released_year']) # default is ascending = True
sorted_year

Interesting! Just scrolling through the top of the list, there are some Christmas classic songs in there that still seem to be widely streamed! ❄ 🎄 ⛄

## Descriptive Statistics 🔎



## 🏋 9.2 Coding Exercise
We will now examine descriptive statistics of our different variables. We are typically interested in different parameters, such as: counts, means, and standard deviation. This will tell us more about our sample -- it will describe the data we have available to work with.<br>

There is a built in method to do just this with `pandas` called [`describe`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html). <br>

The code looks like this: ```dataframe.describe()``` <br>

>**Task:** Write code below using our dataframe name to describe the data.

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
spotify.describe()

Notice that by default, the information is calculated for only the columns that contain **numerical** data. It is possible to specify additional parameters for the `describe` method that will ask for **categorical** info to be displayed as well.

>**Task:** Run the cell below to see the output that includes `all` datatypes. Notice how we will see NaNs for our numerical columns where the descriptives don't apply.

In [None]:
### RUN THIS CELL ###
spotify.describe(include = 'all')

## Groupby  👭

We could instead create our own descriptive table based on a subset of information we are interested in.
Grouping allows us to arrange our data in a way we want to review it. You can think of grouping as splitting the dataset data into buckets 🪣. Then you can call "aggregate" functions (`mean`, `sum`, `max`, `min`, etc) on these buckets to find these values per bucket (which can lead to interesting analysis 📊)!

We will use [`groupby`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) in `pandas` to help us out. <br>

A group by operation consists of two parts:
1. We use ```.groupby``` to group rows together that have the same value for the column(s) we are interested in. Typically, you would always group by categorical columns, not number columns. E.g. If you are interested in analyzing the major versus minor mode of songs, we might do: ```spotify.groupby('mode')```

2. An aggregation function. This is the computation you want to perform across the sets of rows that were grouped together during step 1. ```.count()```, ```.mean()```, and ```.sum()``` are common aggregation functions, but you can also pass your own as an argument to ```.aggregate()```.


Here are examples of using `groupby` to examine our data further 🔬

In [None]:
# Calculate summary statistics
print("Summary statistics:")

# group and aggregate the values
summary_stats = spotify.groupby(by=['artist_name']).agg(
    {'artist_name': 'count', 'bpm': ['mean', 'std']}).round(2)

# display
display(summary_stats)

Reviewing the above code:

In this example, we wanted to see the **count** of `artist_name` and **mean** and **standard deviation** of `bpm` for each artist `artist_name`.<br>

What we did with the code was:

1. `groupby` one column name. We only used one column above, but you can add more than one.
2. After we use `groupby`, we apply the aggregation method `.agg()` and specify a dictionary 📙 in a following way: `{column_name: aggregation function}`. We applied multiple functions at once on different columns.

**Notice**:  We also applied multiple functions on the same column `bpm` by specifying a list that contained multiple functions in brackets like `["mean", "std"]`.
Also, we have NaNs in our standard deviation column. This is why I also included the count column. It makes sense that if we only have 1 song (1 row of data), then we cannot calculate any deviation. This is a good way to check that the values you get from your data make sense!


## 🏋 9.3 Coding Exercise

Now it's your turn to practice using `groupby`. You can start by copy and pasting the code from above. Here we will `groupby` `artist_name` and then aggregate using both `count` (same as above), but instead finding the `min` and `max` of `released_year`. Adjust your code and test it out to review which artists have been releasing songs across multiple years, or to determine if some were one hit wonders! 🎯

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
# Calculate summary statistics
print("Summary statistics:")

# group and aggregate the values
summary_stats = spotify.groupby(by=['artist_name']).agg(
    {'artist_name': 'count', 'released_year': ['min', 'max']})

# display
display(summary_stats)

-----

# Section 10: Plotting with Seaborn (and Matplotlib) 📈 📊

<img src="https://upload.wikimedia.org/wikipedia/commons/8/87/Beauty_in_magic_square.png" width=1000>

[Seaborn](https://seaborn.pydata.org/) is a library for data visualization and figure making. Seaborn works very well with `pandas` dataframes, and you can modify the appearances of your plots using similar settings to how we did it in `numpy`. It can also work together with `matplotlib`. <br> In `seaborn`, we:<br>
(1) decide what plotting function to use, <br>
(2) pass the function a **full dataframe**, <br>
(3) specify which columns in our dataframe we want to use as our x-axis, y-axis, colors, etc. and <br>
(4) specify any additional parameters specific to the plotting function. <br>


```
sns.PLOT_FUNCTION(
    data=YOUR_DATA,
    x='COL TO USE FOR X',
    y='COL TO USE FOR Y',
    hue = 'COL YOU WANT TO SET COLOR ACCORDING TO',
    additional params=...)
```


>**Task:** Review and run the cells below to see examples.


Let's create a plot to examine if there is a relationship between `bpm` and `energy %`? Does this vary by `mode`?

In [None]:
### RUN THIS CELL ###

# Set this text for the plot- easiest to edit here as needed
x_label_name = 'Beats per minute (bpm)'                                       # x-axis label text
y_label_name = 'Energy %'                                                     # y-axis label text
leg_title = 'Mode'                                                            # legend title - column name for hue
plot_title = 'Relationship between BPM and Energy % by Mode'                  # plot title text
fig_name_scatter1 = "Scatter_BPM_Energy_Mode.png"                             # text to name figure for saving

# Example of a scatter plot
sp = sns.scatterplot(                                                         # plotting option to use in seaborn
    data=spotify,                                                             # dataframe
    x='bpm',                                                                  # column name to use values for x-axis
    y='energy_percent',                                                       # column name to use values for y-axis
    hue='mode')                                                               # color code our dots by mode column values

# Adjustment + aesthetics of plot
sp.figure.set_size_inches(6.5, 4.5)                                           # this sets the size and we will use to be consistent across all plots below
sns.despine()                                                                 # remove top and right axes (w/o this, full box border)
plt.xlabel(x_label_name)                                                      # use matplotlib (plt) to adjust the x axis on this one
plt.ylabel(y_label_name)                                                      # use matplotlib (plt) to adjust the y axis on this one
plt.title(plot_title, y=1.05)                                                 # add a title so we know what we plotted!
plt.ylim(0,100)                                                               # use matplotlib (plt) to adjust the y axis limits
plt.legend(bbox_to_anchor=(1.1, 0.05), loc='lower right', title=leg_title);   # use matplotlib (plt) to adjust the legend location and label

# Save our figure
plt.savefig(fig_name_scatter1, dpi=300, bbox_inches='tight')                  # save w/figure name

Let's create a boxplot to examine if there are differences between number of `streams` depending on `release_month` and by `mode`.

In [None]:
### RUN THIS CELL ###

# Set this text for the plot- easiest to edit here as needed
x_label_name = 'Release Month'                                                # x-axis label text
y_label_name = 'Streams'                                                      # y-axis label text
leg_title = 'Mode'                                                            # legend title - column name for hue
plot_title = 'Streams of Songs Released Over Month by Mode'                   # plot title text
fig_name_box = "Boxplot_ReleaseMonth_Streams_Mode.png"                        # text to name figure for saving

# Example of a scatter plot
bp = sns.boxplot(                                                             # plotting option to use in seaborn
    data=spotify,                                                             # dataframe
    x='released_month',                                                       # column name to use values for x-axis
    y='streams',                                                              # column name to use values for y-axis
    hue='mode')                                                               # color code our dots by mode column values

# Adjustment + aesthetics of plot
bp.figure.set_size_inches(6.5, 4.5)                                           # this sets the size and we will use to be consistent across all plots below
sns.despine()                                                                 # remove top and right axes (w/o this, full box border)
plt.xlabel(x_label_name)                                                      # use matplotlib (plt) to adjust the x axis on this one
plt.ylabel(y_label_name)                                                      # use matplotlib (plt) to adjust the y axis on this one
plt.title(plot_title, y=1.1)                                                  # add a title so we know what we plotted!
plt.ylim(0,4000000000)                                                        # use matplotlib (plt) to adjust the y axis limits
plt.legend(bbox_to_anchor=(1.2, 1.05), loc='upper right', title=leg_title);   # use matplotlib (plt) to adjust the legend location and label

# Save our figure
plt.savefig(fig_name_box, dpi=300, bbox_inches='tight')                       # save w/figure name

## 💁 A Few Tips on Data Visualization

Although there are many reviews and studies discussing [best practices for figures](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833), we will list a couple of `pandas` plots you can utilize when you wish to visualize various relationships:

**Visualizing a single variable**
- Histogram (```sns.hist()```)
- Barplot, i.e. for counts of a single variable (```sns.bar()```)

**Visualizing multiple numeric variables**
- Scatter plot/line graph (```sns.scatter```)
- 2d histogram plot (```sns.jointplot()```)

**Visualizing categorial & numerical variable**
- Violinplot/Boxplot (```sns.violinplot()```, ```sns.boxplot()```)
- Stripplot or swarm plot (```sns.stripplot()```, ```sns.swarmplot()```)
- Barplot (```sns.barplot()```, ```sns.boxplot()```)

**Visualizing two categorical variables**
- Heatmap (```sns.heatmap()```)<br><br>


To visualize 3 variables, consider encoding one variable as the color. To visualize 4+ variables, consider using [facetgrid](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html).

## 10.1 🧘 Coding Exercise

Let's look at the top 15 artists with the most songs. We will create a barplot to do this. In this example, we are going to create a smaller subset of the data and reference that instead of the entire dataframe.

>**Task:** Run the first 3 cells to get a feel for the data and then to create an initial plot. Notice that the plot is missing lots of details! Skeleton code is provided to adjust the labels and aesthetics. Adjust the provided code below to create the plot and see which artists fall within the top 15!

In [None]:
### RUN THIS CELL ###
artist_counts = spotify['artist_name'].value_counts()
artist_counts.head(50)

In [None]:
### RUN THIS CELL - THIS IS THE DATA OF THE TOP 15 TO USE ###
artist_plot = artist_counts.head(15)
artist_plot

In [None]:
### RUN THIS CELL ###
ax = sns.barplot(                                              # plotting option to use in seaborn
    x = artist_plot.index,                                     # column name to use values for x-axis
    y = artist_plot,                                           # column name to use values for y-axis
    palette = 'flare')                                         # set the color palette we want to use

Blah! 😏 That doesn't look very nice. Who are the artists? 🎨

>**Task:** In the code cell below, add all of the labels at the beginning and then adjust the aesthetics. Use the code examples for the initial plots to see what you should do.

In [None]:
### EDIT THIS CELL ###
# Set this text for the plot- easiest to edit here as needed
# My solution included 4 lines here
...

# Barplot
ax = sns.barplot(                                              # plotting option to use in seaborn
    x = artist_plot.index,                                     # column name to use values for x-axis
    y = artist_plot,                                           # column name to use values for y-axis
    hue = artist_plot,                                         # color code our dots by mode column values
    palette = 'flare')                                         # set the color palette we want to use

# Adjustment + aesthetics of plot
# My solution included 6 lines here
....
plt.xticks(rotation=45,ha='right')                             # Rotate x-axis labels for better readability

# My solution included code to save the plot
...

In [None]:
#@title Solution
# Set this text for the plot- easiest to edit here as needed
x_label_name = 'Artist Name'                                             # x-axis label text
y_label_name = 'Number of Songs'                                         # y-axis label text
plot_title = 'Top 15 Artists With Most Songs'                            # plot title text
fig_name_bar = 'Top15_artists_bar.png'                                   # text to name figure for saving

# Example of barplot
ax = sns.barplot(                                              # plotting option to use in seaborn
    x = artist_plot.index,                                     # column name to use values for x-axis
    y = artist_plot,                                           # column name to use values for y-axis
    palette = 'flare')                                         # set the color palette we want to use

# Adjustment + aesthetics of plot
ax.figure.set_size_inches(6.5, 4.5)                            # this sets the size and we will use to be consistent across all plots below
sns.despine()                                                  # remove top and right axes (w/o this, full box border)
plt.xlabel(x_label_name)                                       # use matplotlib (plt) to adjust the x axis on this one
plt.ylabel(y_label_name)                                       # use matplotlib (plt) to adjust the y axis on this one
plt.xticks(rotation=45,ha='right')                             # Rotate x-axis labels for better readability
plt.title(plot_title, y=1.1)                                   # add a title so we know what we plotted!
plt.ylim(0,40);                                                # use matplotlib (plt) to adjust the y axis limits

# Save our figure
plt.savefig(fig_name_bar, dpi=300, bbox_inches='tight')                       # save w/figure name


## 🏃 10.2 Coding Exercise

Let's look at the number of songs released per month. We will create a barplot to do this. In this example, we are going to create a smaller dataframe first.

>**Task:** Run the first cell to get the data we want to work with and then create an initial plot. Notice that the plot is missing lots of details! Skeleton code is provided to adjust the labels and aesthetics. Adjust the provided code below to create the plot and see which months have the most songs released.

In [None]:
### RUN THIS CELL ###
songs_over_time = spotify['released_month'].value_counts().reset_index()
songs_over_time.columns = ['released_month', 'count']
songs_over_time

In [None]:
### RUN THIS CELL ###
b2 = sns.barplot(                                          # plotting option to use in seaborn
    data = songs_over_time,                                # dataframe
    x = 'released_month',                                  # column name to use values for x-axis
    y = 'count',                                           # column name to use values for y-axis
    hue = 'released_month',                                # color code our bars
    palette = 'viridis')                                   # set the color palette we want to use

Again we've generated an initial plot that looks okay, but we can improve! We also need to move (remove) that legend.

>**Task:** Edit the code below to beautify the plot ✨ <br>
Hint: To remove the legend, set `legend = False` in the part where `b2` is defined within seaborn.

In [None]:
### EDIT THIS CELL ###
# My solution included 4 lines here
...


b2 = sns.barplot(                                          # plotting option to use in seaborn
    data = songs_over_time,                                # dataframe
    x = 'released_month',                                  # column name to use values for x-axis
    y = 'count',                                           # column name to use values for y-axis
    hue = 'released_month',                                # color code our bars
    palette = 'viridis')                                   # set the color palette we want to use
# My solution added a line here to remove the legend


# My solution included 6 lines here
....

# My solution included code to save the plot
...


In [None]:
#@title Example Solution
x_label_name = 'Release Month'                                             # x-axis label text
y_label_name = 'Number of Songs'                                            # y-axis label text
plot_title = 'Number of Songs by Release Month'                             # plot title text
fig_name_bar2 = 'NumSongs_ReleaseMonth.png'                                 # text to name figure for saving


b2 = sns.barplot(                                          # plotting option to use in seaborn
    data = songs_over_time,                                # dataframe
    x = 'released_month',                                  # column name to use values for x-axis
    y = 'count',                                           # column name to use values for y-axis
    hue = 'released_month',                                # color code our bars
    palette = 'viridis',                                   # set the color palette we want to use
    legend = False)                                        # remove the legend


# Adjustment + aesthetics of plot
b2.figure.set_size_inches(6.5, 4.5)                            # this sets the size and we will use to be consistent across all plots below
sns.despine()                                                  # remove top and right axes (w/o this, full box border)
plt.xlabel(x_label_name)                                       # use matplotlib (plt) to adjust the x axis on this one
plt.ylabel(y_label_name)                                       # use matplotlib (plt) to adjust the y axis on this one
plt.title(plot_title, y=1.1)                                   # add a title so we know what we plotted!
plt.ylim(0,140);                                               # use matplotlib (plt) to adjust the y axis limits

# Save our figure
plt.savefig(fig_name_bar2, dpi=300, bbox_inches='tight')       # save w/figure name


## 🏋 10.3 Coding Exercise

Now let's look at another type of plot to examine correlation values. We will use a `heatmap` to correlate the following variables:<br>
`streams`<br>
`bpm`<br>
`danceability_percent`<br>
`valence_percent`<br>
`energy_percent`<br>
`acousticness_percent`<br>
`instrumentalness_percent`<br>
`liveness_percent`<br>
`speechiness_percent`<br>

First we will create a list of the column names we want to correlate. Then we will use [`.corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)to get the correlation matrix. Example call: <br>
`corr_matrix = dataframe[columns_to_corr].corr()`


Once we have these values, we can create our plot in `seaborn`.

>**Task:** Edit the code below to create our heatmap.

In [None]:
### EDIT THIS CELL ###
# Setup dataframe
columns_to_correlate = ...
corr_matrix = ...

# Title and saving name
# My solution had 2 lines here
....

# Plotting code
hm = sns.heatmap(corr_matrix,        # corr matrix as df
            annot=True,         # include values on map
            #cmap='crest',      # change the color if you want
            fmt=".2f")          # 2 decimals for displayed values

# Adjustment + aesthetics of plot (size and title only)
# My solution had 2 lines here

# Save the fig
# My solution had one line to save the fig


In [None]:
#@title Example Solution

# Setup dataframe
columns_to_correlate = ['streams', 'bpm', 'danceability_percent', 'valence_percent', 'energy_percent', 'acousticness_percent', 'instrumentalness_percent', 'liveness_percent', 'speechiness_percent']
corr_matrix = spotify[columns_to_correlate].corr()

# Labels and title
plot_title = 'Correlation Heatmap'                            # plot title text
fig_name_hmp = 'Corr_heatmap.png'                             # text to name figure for saving


# Plotting code
hm = sns.heatmap(corr_matrix,                                 # corr matrix as df
            annot=True,                                       # include values on map
            #cmap='crest',                                    # change the color if you want
            fmt=".2f")                                        # 2 decimals for displayed values

# Adjustment + aesthetics of plot (size and title only)
hm.figure.set_size_inches(10, 8)                              # this sets the size and we will use to be consistent across all plots below
plt.title(plot_title, y=1.1)                                  # add a title so we know what we plotted

# Save our figure
plt.savefig(fig_name_hmp, dpi=300, bbox_inches='tight')       # save w/figure name

## Bonus: Plotting a Relationship 🤝

Do you think `acousticness` and `energy` have a relationship with each other? If you think so, what direction would you expect (positive or negative)? If you wanted to visualize this relationship, which plot would you use?

>**Task:** Create your plot below

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
# Set this text for the plot- easiest to edit here as needed
x_label_name = 'Acousticness %'
y_label_name = 'Energy %'
leg_title = 'Mode'
plot_title = 'Relationship between Acousticness % and Energy %'
fig_name_scatter2 = 'Scatter_Acoustic_Energy_.png'

# Example of a scatter plot
sp2 = sns.scatterplot(
    data=spotify,
    x='acousticness_percent',
    y='energy_percent')

# Adjustment + aesthetics of plot
sp2.figure.set_size_inches(6.5, 4.5)
sns.despine()
plt.xlabel(x_label_name)
plt.ylabel(y_label_name)
plt.title(plot_title, y=1.05)
plt.ylim(0,100)

# Save our figure
plt.savefig(fig_name_scatter2, dpi=300, bbox_inches='tight')  # save

---

## ⏩ 11.0 Functions


The "commands" or "instructions" we have been using have a name in Python: they are called **functions**.


A function is a block of code which only runs when it is called. The commands we were using that typically involved `()` were actually functions! Example: `np.random.choice()`

You can pass data, known as arguments or parameters, into a function. A function can return nothing at all. A function can also return data as a result.

### 🖊 Writing a function

Not only can you use functions included in the Python libraries you imported, or native to Python, you can also write your own functions.


Defining, or creating a function, always uses the following syntax.

```
def FunctionName(parameter_name1, parametername2, ...):
  FUNCTION BODY
  return ReturnValues
```
  

*   **```def```**: This keyword tells Python that you are about to define a function. Don't forget to also have parentheses and a colon.
*   **Function name**: The name of your function that you are going to use to call it whenever you need it to run!
*   **Function body**: The set of commands or operations you want your function to execute whenever it is called. <font color='red'> Note that the function body (which can be multiple lines) and return statement need to be indented relative to the ```def``` line!</font>
*   **Arguments** (optional): The variable names for optional input values you give the funciton. When writing your function body you can reference the argument variable name to refer to the value of the corresponding input.
*   **```return```** and Return values (optional): The keyword `return`, followed by any data or values you want your function to return when called.

Note that you will need to run the code that defines a function before you are able to call it.





### 🧗 11.1 Coding Exercise

This exercise combines what you've learned about functions, for loops and if statements.
Use a for loop to go through the rows of our dataset. On every iteration of the loop, use the function defined below called ```roll_the_dice()``` 🎲 .
- If the roll of the dice is even, store the `bpm` data for the current row in `evens`.
- If the roll of the dice is odd, store the `bpm` data for the current row in `odds`.

After the loop is complete, take the average of the evens, as well as the odds. Print the two means.


In [None]:
### RUN THIS CELL ###
def roll_the_dice():
  return np.random.choice([1,2,3,4,5,6])

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
evens = []
odds = []

for i in range(len(spotify)):
  dice_roll = roll_the_dice()
  #print('dice_roll: ', dice_roll)
  if (dice_roll==1) or (dice_roll==3) or (dice_roll==5):
     odds.append(spotify.bpm[i])
  else:
     evens.append(spotify.bpm[i])

print('Average of evens:', np.average(evens).round(2))
print('Average of odds:', np.average(odds).round(2))

---


## 12.0 Plotting with Matplotlib

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d9/Julia_set%2C_plotted_with_Matplotlib.svg" width=400>

**Matplotlib** is a graphics library built to play nicely with numpy. Matplotlib was also concieved as an alternative to using matlab's plotting tools, so those familiar with matlab may notice some syntax overlap. Nowadays, there are certainly more advanced plotting libraries than Matplotlib with various beautifying features, but Matplotlib remains popular because of its simplicity and ability to work with nearly every graphics engine.

Recall that previously, we imported matplot library pyplot as plt (See "[Imports & Libraries](https://colab.research.google.com/drive/1mJXrW_IvUCnsQh_vqzjJalsUuF9mJww7#scrollTo=Si-MZ10nODDZ)"). Optionally, we can also set a default style for our matplotlib plots. Some available styles are "ggplot", "classic", "fivethirtyeight", and "tableau-colorblind10". We'll just use the "classic" style for now, but feel free to experiment with different styles by running the following code.

In [None]:
# Run this to set the default matplotlib style (optional)
plt.style.use('fivethirtyeight')

# Example plot
plt.figure(figsize=(3,3))
plt.plot([1,2,3,4], [2,3,4,5], '-x')
plt.show()

### 🖌 Plotting, Rendering, and Saving Figures

The cleanest way to save a figure is:

(1) Initialize a figure object. If you are combining multiple plots onto one figure object you may want to initialize "ax" (axis) objects for each plot as well, so that you can have more control over the settings of each graph later.

(2) Plot your data on your figure. Matplotlib lets you do this directly from the pyplot module (```plt.plot(x,y)```), or you can apply the plot function to the specific axes (subplot) you wish to graph to. (```ax[0].plot(x,y)```).

(3) Save your figure to a filename.

(4) Use ```plt.show()``` to render your figure.


This strategy should work in colab, jupyter notebook, or within a Python script.

</br>

Here are a two different example workflows for creating images. Feel free to look in your Colab files for the saved figure. This same workflow should work in a jupyter notebook, a Python script, or a Python command line as well.

In [None]:
### EXAMPLE 1 ####
print("Example 1...")
x = np.array([1,2,3,4,5,6])
y = np.array([1,4,5,6,7,19])

# Initialize figure
# Note that we can set a figsize parameter
# to specify the window size.
fig = plt.figure(figsize=(4,3))

# Simple scatter plot
plt.plot(x,y, 'x')
plt.xlabel('x')
plt.ylabel('y')

# Save figure
plt.savefig('example1.png')

# Render figure in colab window
plt.show()

print('Example 2...')

### EXAMPLE 2 ###

# Initialize figure
# This time we will initialize a figure with subplots,
# consisting of 3 graphs, each next to the other.
fig, ax = plt.subplots(figsize=(12,3), ncols=3)

# One scatter plot and one line plot.
ax[0].plot(x,y, 'x')
ax[1].plot(x,y, '-')
ax[2].plot(x,y, 'o')


for i in range(3):
  ax[i].set_xlabel('x')
  ax[i].set_ylabel('y')

# Save figure
plt.savefig('example2.png')

# Render figure in colab window
plt.show()

### 🌈 Multiple Lines on a graph

There are many cases where we may want to show 3 or more dimensions on a graph, using color or shape as the third dimensions. There are two strategies for doing this.

(1) Plot each line/distribution on the graph at a time, specifying the label and color for each line

or

(2) Plot a multidimensional numpy array, specifiying which axis should be the x, y, color, or shape axes.


Here is an example, build a histogram of different candy costs in 50 states (randomly generated).

In [None]:
sour_patch_kids_cost = np.random.rand(50)+.5
hershey_bar_cost = np.random.rand(50)+.25
reeses_cost = np.random.rand(50)+1


# We will use 10 cent incrememnts from 0-3$
# for our histogram buckets (optional argument).
cost_bins = [i*.1 for i in range(30)]


##### Approach 1 ######
f = plt.figure(figsize=(5,4))

# The alpha parameter refers to how opaque each color is.
print("Example 1...")
plt.hist(sour_patch_kids_cost, color='green', label='sour patch', alpha=.7, bins=cost_bins, histtype='stepfilled')
plt.hist(hershey_bar_cost, color='brown', label='hershey', alpha=.7, bins=cost_bins, histtype='stepfilled')
plt.hist(reeses_cost, color='orange', label='reeses', alpha=.7, bins=cost_bins, histtype='stepfilled')
plt.legend()
plt.xlabel('cost per bar')
plt.ylabel('# of stores')
plt.show()


##### Approach 2 ####
print("Example 2...")
f = plt.figure(figsize=(5,4))
all_candy_bars = np.array([sour_patch_kids_cost,
                           hershey_bar_cost,
                           reeses_cost]).transpose()
plt.hist(all_candy_bars, alpha=.8,
         color=['green', 'brown', 'orange'],
         label=['sour patch', 'hershey', 'reeses'],
         histtype='stepfilled', bins=cost_bins)
plt.xlabel('cost per bar')
plt.ylabel('# of stores')
plt.legend()
plt.show()


### 💄 Beautifying your plots


A complaint many have about Matplotlib is that the plots are not very aesthetically appealing by default. However, Matplotlib offeres numerous settings you can control in order to have precise control over what your plots look like. Here are a few examples of what you can control and how to go about changing these settings.

Let's start with the following plots, and experiment with changing some of the parameters.


**Colors and Shapes**

Typically, we pass in arguments for what we want the colors and shapes of a plot to look like, when we call the plotting function (```.plot()```, ```.hist()```, etc.).  You can also set a specific color palette for Matplotlib to select colors from.



In [None]:
f, ax = plt.subplots(ncols=2, nrows=2, figsize=(10,10))

# Some random distributions for variables.
x = np.random.random(500)
y = x + .2*np.random.randn(500)
i = np.random.randn(500)
j = np.random.randn(500)

# Plotting different types of plots.
# Sometimes if you want to edit a figure,
# it can be helpful to return an object from the plotting function
# and then you can mess around with properties of that object.
# (See plt3 for an example)
plt1 = ax[0,0].plot(x,y,
             marker='x', linestyle='none',
             markersize=15, color='red', label='dist1')
plt11 = ax[0,0].plot(x,i,
             marker='.', linestyle='none',
             markersize=8, color='blue', label='dist2')

# We can also specify our colors in hexcode.
plt2 = ax[0,1].hist(x, color='#a032a8')


# patch_artist=True lets us color in the boxes.
plt3 = ax[1,0].boxplot([x,y,i,j], patch_artist=True)
# We can reference the object output by our plotting function
# in order to change various parameters.
plt3['boxes'][0].set_color('red')
plt3['boxes'][1].set_color('orange')
plt3['boxes'][2].set_color('yellow')
plt3['boxes'][3].set_color('pink')

# See matlab documentation for different color maps
plt4 = ax[1,1].hist2d(i,j, bins=20, cmap='spring')

plt.show()

### Plot Background

We can also edit the backgrounds of the plots - this might mean adding/removing gridlines, changing the background color, or even adding additional text or figures to the plots.  Here are some examples for how we might edit the background attributes of our figures.

In [None]:
# We can turn on/off legends, colorbars, etc.
ax[0,0].legend()

# We can add text, and shapes to plots.
ax[1,0].text(s='this is \na boxplot', x=1, y=-2, fontsize=20)


# We can turn gridlines on / off.
ax[1,1].grid(visible=True, axis='both', color='black')

# We can set the background color.
ax[0,1].set_facecolor('grey')

# Show updated figure.
f


### Axes

We can also edit the x- and y-axes, ticks, and labels.

(Note: in python we unfortunately often refer to subplots as "axes". These are **different** from the x- and y- axes, which each subplot contains.)

Here are a few examples for how you might want to modify your x and y axes.


In [None]:
# We can set the labels for our axes.
ax[0,0].set_xlabel('x', fontsize=20)
ax[0,0].set_ylabel('y', fontsize=20)

# We can remove ticks, or change the scale of the ticks.
ax[0,1].set_xticks([])
ax[0,1].set_yscale('log')

# We can set different tick labels.
ax[1,0].set_xticklabels(['x', 'y', 'i', 'j'], fontsize=14)

# We can rotate the labels,
# while keeping the tick labels the same.
ax[1,1].set_xticklabels(ax[1,1].get_xticks(), rotation=45)

# Show updated figure.
f

---

## 13.0 Reading and Writing Files

### ✍️ Writing to a File

When you write to a file, you can either create a new file or completely overwrite an existing one. This is done using the `'w'` mode.

The best way to work with files is using the `with` statement. It automatically handles closing the file for you, even if errors occur.

After running this code, a file named `my_file.txt` will be created with two lines of text. The `\n` is a special character that represents a newline.

In [None]:
# The 'w' opens the file in write mode. If the file doesn't exist, it's created.
# If it does exist, its contents are erased.
with open('my_file.txt', 'w') as f:
    f.write("Hello, Python!\n")
    f.write("This is the second line.\n")

You can also use `writelines()` to write a list to a file

In [None]:
my_text = ["Hello, Python!", "This is the second line."]

# If it does exist, its contents are erased.
with open('my_file.txt', 'w') as f:
    f.writelines("\n".join(my_text))

### appending to a File

If you want to add content to the end of an existing file without deleting its current content, you should open it in append mode using 'a'.

In [None]:
# The 'a' opens the file in append mode.
# New content is added to the end of the file.
with open('my_file.txt', 'a') as f:
    f.write("This line was appended.\n")

### Reading the Entire File into Memory

This method is simple and useful for small files. It reads the entire file content into a single string variable. Be careful, as this can consume a lot of memory if the file is very large! - `read()`

In [None]:
# The 'r' opens the file in read mode.
with open('my_file.txt', 'r') as f:
    content = f.read()
    print(content)

### Reading Line by Line (Memory Efficient)

This is the recommended way to read files, especially large ones. It reads the file one line at a time, which is very memory-efficient. You can process each line as it's read using a for loop.

In [None]:
with open('my_file.txt', 'r') as f:
    for line in f:
        # The .strip() method removes leading/trailing whitespace, including the newline character.
        print(line.strip())

### Reading All Lines into a List

You can also read all the lines of a file into a list of strings. Each item in the list corresponds to a line in the file. Like `read()`, this method loads the entire file into memory, so it's best for smaller files. - `readlines()`

In [None]:
with open('my_file.txt', 'r') as f:
    lines_list = f.readlines()
    print(lines_list)


### Using `csv` package for reading and writing

In [None]:
import csv

# write
rows = [["id","value"], ["A",10], ["B",20]]
with open("table.csv", "w", newline="", encoding="utf-8") as f:
    w = csv.writer(f)
    w.writerows(rows)

# read
with open("table.csv", "r", encoding="utf-8") as f:
    for row in csv.reader(f):
        print(row)

### Using `json` package for reading and writing

The `json` package is a built-in Python library that allows you to work with JSON (JavaScript Object Notation) data. JSON is a lightweight, text-based format that is easy for humans to read and for machines to parse. It's commonly used for storing configuration data and for exchanging information between a web server and a client.

The `json` package primarily helps you convert Python objects (like dictionaries and lists) into JSON formatted strings, and vice-versa.

To write a Python object (like a dictionary) to a file in the JSON format, you use the `json.dump()` function. This function takes the Python data and the file object you want to write to.

A very useful argument is indent, which formats the JSON file with indentation, making it much more readable.

In [None]:
import json

user_data = {
    "name": "Alex",
    "id": 12345,
    "is_active": True,
    "courses": ["History", "Math", "Science"]
}

# 'w' opens the file in write mode
with open('user_data.json', 'w') as f:
    # json.dump() writes the dictionary to the file in JSON format
    # indent=4 makes the file human-readable
    json.dump(user_data, f, indent=4)


To read a JSON file and convert it back into a Python object, you use the `json.load()` function. This function takes the file object as an argument and returns a Python dictionary or list.

In [None]:
import json

# 'r' opens the file in read mode
with open('user_data.json', 'r') as f:
    # json.load() reads the file and parses the JSON data into a Python object
    data = json.load(f)

# Now 'data' is a Python dictionary
print(f"User's name: {data['name']}")
print(f"Courses: {data['courses']}")


### 13.1 Coding Exercise 1: Create and Read a To-Do List 📝

Write a Python script that creates a file named todo.txt. Write three tasks to the file, each on a new line. After writing, read the entire file and print its contents to the console.

In [None]:
### YOUR CODE HERE ###

In [None]:
#@title Example Solution
tasks = ["Finish homework\n", "Go grocery shopping\n", "Call mom\n"]

with open('todo.txt', 'w') as f:
    f.writelines(tasks)

with open('todo.txt', 'r') as f:
    print(f.read())

### 13.2 Coding Exercise : Update and Count Log Entries 🔢

Imagine you have a file named app.log that already contains some entries. Write a script that appends a new entry, `"Application restarted."`, to the end of the file. Then, read the file and print the total number of log entries (lines).

In [None]:
### YOUR CODE HERE ###
with open('app.log', 'w') as f:
    f.write("System initialized.\n")
    f.write("User authenticated.\n")


In [None]:
#@title Example Solution
# First, create a dummy log file for the exercise
with open('app.log', 'w') as f:
    f.write("System initialized.\n")
    f.write("User authenticated.\n")

# Now, append and count the lines
with open('app.log', 'a') as f:
    f.write("Application restarted.\n")

with open('app.log', 'r') as f:
    print(f"Total log entries: {len(f.readlines())}")

### 13.3 Coding Exercise : JSON Weather Data

We are going to download weather data from `wttr.in` which provides weather data in `json` format. After downloading, let's open the file up and look at the structure.

In [None]:
!curl wttr.in/Reno+NV?format=j1 > reno.json

Let's open the file and look at the possible fields.

We are going to write a human-readable summary output from this input JSON file.

In [None]:
import json
import datetime

# Load the JSON data
with open('reno.json', 'r') as f:
    data = json.load(f)

def get_day_with_suffix(d):
    return str(d) + ("th" if 11 <= d <= 13 else {1: "st", 2: "nd", 3: "rd"}.get(d % 10, "th"))

now = datetime.datetime.now()

# Parse the city name from the JSON
city_name = data['nearest_area'][0]['areaName'][0]['value']

# Calculate all date/time values
current_day = now.strftime("%A")
current_time = now.strftime("%I:%M %p") # e.g., 04:49 PM
month_name = now.strftime("%B") # e.g., September
day_number = get_day_with_suffix(now.day) # e.g., 19th

# Extract the specific data points into variables for easier access
current = data['current_condition'][0]
weather = data['weather'][0]
astronomy = weather['astronomy'][0]

# Build the final report using an f-string
report = f"""
### {city_name} Weather Report

Good afternoon, {city_name}! It's {current_time} on a beautiful {current_day}, {month_name} {day_number}.



Right now, we're enjoying conditions under {current['weatherDesc'][0]['value'].lower()} skies.
The current temperature is {current['temp_F']}°F, though with the sun out it feels a little warmer at {current['FeelsLikeF']}°F.
The humidity is sitting at a dry {current['humidity']}%,  with precipitation at {current['precipInches']} inches.
The UV index is currently a moderate {current['uvIndex']}.

Looking at the full picture for today, we'll see a high of {weather['maxtempF']}°F and a low tonight of {weather['mintempF']}°F.
The sun will be setting this evening at {astronomy['sunset']}.

For you stargazers, the moon is currently in its {astronomy['moon_phase']} phase and will rise early tomorrow morning at {astronomy['moonrise']}.

Enjoy the rest of your {current_day}!
"""

# Print the final, formatted report
print(report)

---

## 14.0 Writing our own Parser

We will be writing our own data parser. A parser is a program designed to read a particular file with specific formats. From the data it reads, it stores significant results in a data structure (like `dict`), in order to summarize or work with the data later in the program.

We will be working with a typical annotation format within Bioinformatics called a GFF3 file.  Specifications about this format can be found <a href="https://useast.ensembl.org/info/website/upload/gff3.html">here</a>, but I'll summarize below.

- 9 tab seperated columns
  - `seqid source type start end score strand phase attributes`
  - Attributes are `key=value` pairs separated by `;`, e.g. `ID=gene1;Name=Foo;Parent=trans1.`

Let's create a minimal example before we work with a larger file.

In [None]:
sample_gff = """##gff-version 3
chr1\tdemo\tgene\t1000\t1800\t.\t+\t.\tID=g1;Name=Gene1
chr1\tdemo\tmRNA\t1000\t1800\t.\t+\t.\tID=g1.t1;Parent=g1
chr1\tdemo\texon\t1000\t1200\t.\t+\t.\tID=g1.t1.ex1;Parent=g1.t1
chr1\tdemo\texon\t1500\t1800\t.\t+\t.\tID=g1.t1.ex2;Parent=g1.t1

chr1\tdemo\tgene\t3000\t4200\t.\t-\t.\tID=g2;Name=Gene2
chr1\tdemo\tmRNA\t3000\t4200\t.\t-\t.\tID=g2.t1;Parent=g2
chr1\tdemo\texon\t3000\t3200\t.\t-\t.\tID=g2.t1.ex1;Parent=g2.t1
chr1\tdemo\texon\t3400\t3600\t.\t-\t.\tID=g2.t1.ex2;Parent=g2.t1
chr1\tdemo\texon\t4000\t4200\t.\t-\t.\tID=g2.t1.ex3;Parent=g2.t1

chr2\tdemo\tgene\t500\t1300\t.\t+\t.\tID=g3;Name=Gene3
chr2\tdemo\tmRNA\t500\t1300\t.\t+\t.\tID=g3.t1;Parent=g3
chr2\tdemo\texon\t500\t800\t.\t+\t.\tID=g3.t1.ex1;Parent=g3.t1
chr2\tdemo\texon\t1000\t1300\t.\t+\t.\tID=g3.t1.ex2;Parent=g3.t1
"""

with open("mini.gff3", "w") as f:
    f.write(sample_gff)

print("Wrote mini.gff3")

### Plan the data structures

We’ll collect:

`feature_counts`: counts by type (`gene`, `mRNA`, `exon`, …)

`genes`: `gene_id -> {seqid, start, end, strand}`

`transcripts`: `mrna_id -> {gene_id, seqid, start, end, strand}`

`exons_by_tx`: `mrna_id -> list of (start, end)`

`genes_by_seqid`: `seqid -> list of gene_ids (for per-contig stats)`

### Write a helper function to parse the attributes column

In [None]:
def parse_attrs(attr_field: str) -> dict:
    """
    Parse a GFF3 attributes string like 'ID=g1;Parent=g0;Name=Foo'
    into a dict. Handles empty fields '.'.
    """
    out = {}
    attr_field = attr_field.strip()
    if attr_field == "." or not attr_field:
        return out
    for chunk in attr_field.split(";"):
        if not chunk:
            continue
        if "=" in chunk:
            k, v = chunk.split("=", 1)
            out[k] = v
        else:
            # Some files have flag-like attributes without '='
            out[chunk] = True
    return out

# quick test
parse_attrs("ID=g1;Parent=g0;Note=hello")

### Read the GFF3 file and store data

In [None]:
from collections import defaultdict, Counter

feature_counts = Counter()
genes         = {}                          # gene_id -> dict
transcripts   = {}                          # mrna_id -> dict
exons_by_tx   = defaultdict(list)           # mrna_id -> list[(start,end)]
genes_by_seqid = defaultdict(list)          # seqid -> [gene_ids]

with open("mini.gff3") as fh:
    for line in fh:
        line = line.rstrip("\n")
        if not line or line.startswith("#"):
            continue  # skip comments

        parts = line.split("\t")
        if len(parts) != 9:
            # not a valid GFF3 data line; skip or raise
            continue

        seqid, source, ftype, start, end, score, strand, phase, attrs = parts
        start, end = int(start), int(end)  # GFF3 uses 1-based inclusive coords
        a = parse_attrs(attrs)

        feature_counts[ftype] += 1

        if ftype == "gene":
            gid = a.get("ID")
            if gid:
                genes[gid] = {
                    "seqid": seqid, "start": start, "end": end, "strand": strand
                }
                genes_by_seqid[seqid].append(gid)

        elif ftype in ("mRNA","transcript"):
            tid = a.get("ID")
            parent_gene = a.get("Parent")
            if tid and parent_gene:
                transcripts[tid] = {
                    "gene_id": parent_gene, "seqid": seqid,
                    "start": start, "end": end, "strand": strand
                }

        elif ftype == "exon":
            parent_tx = a.get("Parent")
            if parent_tx:
                exons_by_tx[parent_tx].append((start, end))

# quick peek
feature_counts, list(genes.keys()), list(transcripts.keys())


### Calculate Summary Statistics

In [None]:
print("Feature counts:", dict(feature_counts))
print("Genes:", len(genes))
print("Transcripts:", len(transcripts))
print("Total exons:", sum(len(v) for v in exons_by_tx.values()))


In [None]:
#GFF3 coordinates are inclusive; length = end - start + 1
gene_lengths = {gid: (g["end"] - g["start"] + 1) for gid, g in genes.items()}

n = len(gene_lengths)
avg = sum(gene_lengths.values()) / n
mn_gid = min(gene_lengths, key=gene_lengths.get)
mx_gid = max(gene_lengths, key=gene_lengths.get)

print(f"Average gene length: {avg:.1f} bp")
print(f"Shortest gene: {mn_gid} ({gene_lengths[mn_gid]} bp)")
print(f"Longest gene:  {mx_gid} ({gene_lengths[mx_gid]} bp)")


In [None]:
# Genes per contig and per strand
print("Genes per contig:")
for seqid, gids in genes_by_seqid.items():
    print(f"  {seqid}: {len(gids)}")

strand_counts = Counter( g["strand"] for g in genes.values() )
print("Genes by strand:", dict(strand_counts))


In [None]:
# Exons per transcript and per gene
exons_per_tx = {tid: len(exons_by_tx.get(tid, [])) for tid in transcripts}
avg_exons = sum(exons_per_tx.values()) / max(1, len(exons_per_tx))
print("Exons per transcript:", exons_per_tx)
print(f"Average exons/transcript: {avg_exons:.2f}")

# If you had multiple transcripts per gene, aggregate:
exons_per_gene = defaultdict(int)
for tid, exs in exons_by_tx.items():
    gid = transcripts.get(tid, {}).get("gene_id")
    if gid:
        exons_per_gene[gid] += len(exs)

print("Exons per gene:", dict(exons_per_gene))


### Make this into a program so that it can be ran with different input files

Copy the following code and create a new file called `parse_gff3.py` in a text editor like Notepad or Sublime Text. Then upload this file to google collab

In [None]:
from collections import defaultdict, Counter
import argparse

feature_counts = Counter()
genes         = {}                          # gene_id -> dict
transcripts   = {}                          # mrna_id -> dict
exons_by_tx   = defaultdict(list)           # mrna_id -> list[(start,end)]
genes_by_seqid = defaultdict(list)          # seqid -> [gene_ids]

def parse_attrs(attr_field: str) -> dict:
    """
    Parse a GFF3 attributes string like 'ID=g1;Parent=g0;Name=Foo'
    into a dict. Handles empty fields '.'.
    """
    out = {}
    attr_field = attr_field.strip()
    if attr_field == "." or not attr_field:
        return out
    for chunk in attr_field.split(";"):
        if not chunk:
            continue
        if "=" in chunk:
            k, v = chunk.split("=", 1)
            out[k] = v
        else:
            out[chunk] = True
    return out

# --- argparse: get the input file path ---
parser = argparse.ArgumentParser(description="Summarize a GFF3 file (simple parser).")
parser.add_argument("gff3", help="Path to input GFF3 file")
args = parser.parse_args()

with open(args.gff3, "r") as fh:
    for line in fh:
        line = line.rstrip("\n")
        if not line or line.startswith("#"):
            continue  # skip comments

        parts = line.split("\t")
        if len(parts) != 9:
            # not a valid GFF3 data line; skip or raise
            continue

        seqid, source, ftype, start, end, score, strand, phase, attrs = parts
        start, end = int(start), int(end)  # GFF3 uses 1-based inclusive coords
        a = parse_attrs(attrs)

        feature_counts[ftype] += 1

        if ftype == "gene":
            gid = a.get("ID")
            if gid:
                genes[gid] = {
                    "seqid": seqid, "start": start, "end": end, "strand": strand
                }
                genes_by_seqid[seqid].append(gid)

        elif ftype in ("mRNA","transcript"):
            tid = a.get("ID")
            parent_gene = a.get("Parent")
            if tid and parent_gene:
                transcripts[tid] = {
                    "gene_id": parent_gene, "seqid": seqid,
                    "start": start, "end": end, "strand": strand
                }

        elif ftype == "exon":
            parent_tx = a.get("Parent")
            if parent_tx:
                exons_by_tx[parent_tx].append((start, end))

print("Feature counts:", dict(feature_counts))
print("Genes:", len(genes))
print("Transcripts:", len(transcripts))
print("Total exons:", sum(len(v) for v in exons_by_tx.values()))

#GFF3 coordinates are inclusive; length = end - start + 1
gene_lengths = {gid: (g["end"] - g["start"] + 1) for gid, g in genes.items()}

n = len(gene_lengths)
avg = sum(gene_lengths.values()) / n
mn_gid = min(gene_lengths, key=gene_lengths.get)
mx_gid = max(gene_lengths, key=gene_lengths.get)

print(f"Average gene length: {avg:.1f} bp")
print(f"Shortest gene: {mn_gid} ({gene_lengths[mn_gid]} bp)")
print(f"Longest gene:  {mx_gid} ({gene_lengths[mx_gid]} bp)")

# Genes per contig and per strand
print("Genes per contig:")
for seqid, gids in genes_by_seqid.items():
    print(f"  {seqid}: {len(gids)}")

strand_counts = Counter( g["strand"] for g in genes.values() )
print("Genes by strand:", dict(strand_counts))


# Exons per transcript and per gene
exons_per_tx = {tid: len(exons_by_tx.get(tid, [])) for tid in transcripts}
avg_exons = sum(exons_per_tx.values()) / max(1, len(exons_per_tx))
print("Exons per transcript:", exons_per_tx)
print(f"Average exons/transcript: {avg_exons:.2f}")

# If you had multiple transcripts per gene, aggregate:
exons_per_gene = defaultdict(int)
for tid, exs in exons_by_tx.items():
    gid = transcripts.get(tid, {}).get("gene_id")
    if gid:
        exons_per_gene[gid] += len(exs)

print("Exons per gene:", dict(exons_per_gene))


Now we will test running our python script on our `mini.gff3` file on the commandline. Notice, the input file is *not* hard coded into the analysis script so this will work with any provided filename

In [None]:
!python parse_gff3.py mini.gff3

Let's download another GFF3 file and test our script on this file.  We will download the E. Coli GFF3 file for this test then uncompress the data. Then we will run our python program.

In [None]:
!wget --no-check-certificate https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz
!gunzip GCF_000005845.2_ASM584v2_genomic.gff.gz

In [None]:
!python parse_gff3.py GCF_000005845.2_ASM584v2_genomic.gff


---



# Extras


## 🔊 Broadcasting

Broadcasting involves performing a computation using two or more arrays that share **one** dimension. Perhaps you want to divide every row of array A element-by-element by array B. Here is a plausible real life example where we might want to broadcast. Many numpy functions will be able to figure it out

In [None]:
# An array with the number of hours we worked on
# three different projects throughout the workweek.
hours_on_each_project = np.array([[2,4,2],[1,7,0],[5,2,1],[8,0,0],[3,3,2]])
billing_rate_per_project = np.array([50,60,25]) # $/hour

# Money billed to each project for each day of the workweek.
money_on_each_project = hours_on_each_project*billing_rate_per_project
print("money billed to each project:\n", money_on_each_project)



## 🔽 	▶️  Aggregating Across an Axis

We may want to peform an operation across one dimension of an array (row, column, or a third/fourth dimension). Here we can use ```np.apply_along_axis()``` function to specify the array, function, and axis we want to apply a function on.
Here is an example. Notice how ```axis=0``` and ```axis=1``` perform the function on different dimensions of the array.

In [None]:
price_per_meal = np.array([[8,12,23],  # Bfast, lunch, dinner - day 1
                           [6,12,32],  # Bfast, lunch, dinner - day 2
                           [8,15,19],  # Bfast, lunch, dinner - day 3...
                           [3,9,40],
                           [4,18,22]])

average_meal_cost = np.apply_along_axis(arr=price_per_meal, func1d = np.mean, axis=0)
print('average_meal_cost:\n', average_meal_cost) # Avg meal cost for bfast, lunch, and dinner.


total_cost_each_day= np.apply_along_axis(arr=price_per_meal, func1d = sum, axis=1) # python comes with a native sum function.
print('\ntotal_cost_each_day:\n', total_cost_each_day) # Total meal cost per each day.



</br>
<img src="https://edcarp.github.io/2018-11-06-edinburgh-igmm-python/fig/python-operations-across-axes.png" width=600>
</br>

Here is an example (from a different dataset), but I like this image to illustrate how the axes work.

What if we need the maximum inflammation for each patient over all days (as in the  diagram on the left) or the average for each day (as in the diagram on the right)? As the diagram below shows, we want to perform the operation across an axis.


Image [source](https://edcarp.github.io/2018-11-06-edinburgh-igmm-python/01-numpy/index.html)

-----

# Just for fun 🎧
Read about some of the top [trends from 2023 on Spotify](https://newsroom.spotify.com/2023-11-29/top-songs-artists-podcasts-albums-trends-2023/). Do you recognize any of these names from our dataset?


---
# Technical Notes and Credits 👏 🙏

The exercises for this notebook were adapted from resources available from the [Python for Supervised Machine Learning Bootcamp](https://nevadainbre.github.io/bootcamp-supervised-ml-2023-01.html) and Dr. Brianna Chrisman.
Exercises were also adapted from [Programming in Python Data Carpentry Resources](https://edcarp.github.io/2018-11-06-edinburgh-igmm-python/).

Thanks to the Data Science Initiative at UNR, supported by a Nevada INBRE supplemental award for Building Data Science Capacity, along with Research & Innovation for supporting this workshop during the first [Data Science Conference](https://www.unr.edu/bioinformatics/training-events/data-science-conference). Shout out and special thanks to Dr. Juli Petereit (Director of Nevada Bioinformatics Center) for her support with the Data Science Initiative. Another special thanks to Dr. Theresa McKim for creating this workshop and allowing extension of it.
