In [5]:
!gsutil cp gs://jax-presgraves-edusumner2-courses-code/IntrotoPythonII_2025/Module_4_material/*.csv /content/

'gsutil' is not recognized as an internal or external command,
operable program or batch file.


# Summary of previous notebooks: 
---
1. Data types (lists, strings, num, etc) and variables
2. For loops
3. Conditions: if elif else
4. While loops (in passing)
5. I/O

# Summary of this notebook:
---
1. **FUN**CTIONS
2. Modules and libraries
3. Using libraries: Numpy, Pandas, Seaborn, Matplotlib and more. 

## Data sets: 
(For this notebook)
1. titanic_pandas.csv
    * Original dataset available here: https://github.com/pandas-dev/pandas/blob/main/doc/data/titanic.csv

(Note: Data sets for the extra material notebook associated with Day4_A:) 

2. malariaMaize.csv

3. mammals.csv
    * find it here: https://zief0002.github.io/bespectacled-antelope/codebooks/mammals.html
    * the extra materials notebook will demo of how to read it into your notebook


## Additional resources: 
1. [NumPy: A guide for absolute beginners](https://numpy.org/doc/stable/user/absolute_beginners.html)
2. [A list of the methods (and attributes) available with NumPy module](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.html)
3. [NumPy Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
3. Pandas: Same as pythontutor.com there is..... ta da! **https://pandastutor.com/**
3. [Ten minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html) (really good!)
4. [Pandas Cheat Sheet](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
5. [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)
6. [Matplotlib Example Library](https://matplotlib.org/stable/gallery/index.html)
7. [Visualization with Matplotlib](https://pandas.pydata.org/pandas-docs/version/0.15.0/visualization.html)

# Functions
---
**Procedural programming has an emphasis on bundling repeatable 'chunks' together in:**
1. For loops

2. Functions
   * When you bundle up functions, you can create Modules and, eventually, when they get too large, external libraries/packages


## Built-in Functions
We've already seen some built in methods that are associated with particular data type.
* e.g. `.upper()`, `.lower()`, `len()`, `str()`
    
We've also seen built in functions that work more generally and might even work differently depending on the arguments that they are given.
*  `range()`
    * `range(6)` or `range(0,6)` result in: [0,1,2,3,4,5]
    * `range(1,6,3)` results in: [1,4] since it encompasses the range from 1-6 with steps of size 4
* `type()` – returns type of data that is input as argument
* `float()` – converts integer to a float
* `round()`- rounds floats
* `max()` – takes any number of arguments and returns the largest one. Hard to use on strings so use of floats and integers
* `min()` – opposite of `max()`
* `abs()` – absolute value but only takes one argument

If you are interested, you can explore the [list of built in functions](https://docs.python.org/3/library/functions.html).

## Creating functions 
Creating our own functions offers a number of advantages: 
* Allows us to re-use a block of code many times
    * Reusable piece of code; allows for repeatability
* If we need to change the code, we only have to do it to the one function (instead of each time it is called in a program)
* Encapsulation (an important programming idea): Splitting code into functions, into little chunks that we can work on independently 
    * benefits of modularization
    * Can use the block of code in other programs

### Parts of a function: 
1. Header
    * `def function_name` – there are nomenclature rules
    * parameters – name of arguments located between `()`
    * `:`
    * Example:
        * `def hello_world():`
2. (optional) comment
    * explanation of function following `#` 
3. Body
    * The procedures of the function, indented four spaces

**Example:**

<div class="alert alert-block alert-warning">
    
    def function_name(argument or nothing at all):
        Blah
        Blah
        Blah
        return something


### Key Points about Functions
* Functions don’t necessarily have to return a value
    * If no return, they will return None
* Functions don’t need to take arguments
* Functions don’t need to take arguments in particular order if you include a key word
* Function arguments can have defaults 
* Functions can use *args, an argument passed into a function that **allows a flexible number of items to be inputted when the function is called**

#### **Important Note:** 
There is a BIG difference between **defining** a function and **calling** it.
* After defining a function, you must call it in order for it to be executed.
* **Functions return to where they are called from.**
    * At the end of a function, `return` prompts Python to exit the function and assign any values on the return line to the variable that called the function in the first place. If you do not include a return line, Python will automatically provide “None” for you.
    
To summarise: **the calling function suspends execution at the point of the call, the body of the function is executed, and control is returned to just after the point where the function was called.**
    
Similar **scoping rules** apply to functions as to loops. That is: variables created within the body of a function cannot be called outside of the function or you will get a NameError.

### Calling a function
* What return value/type?
* What argument type/number?
* Sometimes you also need to know the order of the arguments
    1. Key word arguments:
        * Allows us to call functions with list of variables in whatever order we like
    2. Defaults:
        * Allows us to specify default values for arguments


### Examples of User-Defined Functions:

In [6]:
# here you are defining the function
# This function raises a given number to a given power
# NOTE: you should put all of your function definitions at the top of your program 
def power_to(base,exponent):
    res = base**exponent
    print(str(base)+ " to the power of " + str(exponent)+ " is: " +str(res))
    print("Here is the return statement to orient me")
    # The function returns something, but it doesn't HAVE TO. 
    return res

# here you are calling the function
# you are passing in two arguments to the function when you call it
# from the main part of the program
result_main=power_to(10,2)
result_again=power_to(100,2)

# Why does this raise an exception? 
#print(res)
#print(base)
# This error should help you understand scope. 

print(result_main)
print(result_again)

10 to the power of 2 is: 100
Here is the return statement to orient me
100 to the power of 2 is: 10000
Here is the return statement to orient me
100
10000


In [7]:
## Functions can return a boolean
def is_GandC_rich(dna):
    length = len(dna)
    g_count = dna.upper().count("G")
    c_count = dna.upper().count("C")
    gc_count=(g_count + c_count)/length
    if gc_count > 0.65:
        return True
    else:
        return False
# Another slightly more efficient way to do this would be:
#   return gc_count > 0.65

print(is_GandC_rich("CGCGCGTACG"))
print(is_GandC_rich("ATATATATATA"))

True
False


### [Rosalind Problem Revisted](https://rosalind.info/problems/gc/)

Let's modify this problem slightly to find the AT content of a provided DNA string.

The total percentage of A and T in the sequence is important since it can be informative about mutation rates, [as described by Hershberg & Petrov
](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001115).

Solve this using functions. We can dissect out the cell below to figure it out. 

In [8]:
# notice that we are passing two arguments to this function and 
# one of them has a default value
# also sig_dig is a keyword
def get_at_content(dna,sig_dig=2):
    length=len(dna)
    #print(length)
    a_count=dna.upper().count("A")
    t_count=dna.upper().count("T")
    at_content=(a_count+t_count)/length
    #built in function round takes argument and number of decimal points 
    return round(at_content,sig_dig)

# ----
# main part of the program
# ----
my_at_content=get_at_content("ATATATATACGGGGGGGGG")
# no sig dig have been specified so it gives you the default of 2 
print(str(my_at_content))
print("_______")

# we can specify both arguments
print(get_at_content("aactgtagcga",5))
print("_______")

# BY USING KEY WORDS, YOU DON'T EVEN HAVE TO PROVIDE THE ARGUMENTS IN THE SAME ORDER!
print(get_at_content(sig_dig=5,dna="ATGCGATAGTATCCCTAGGAT"))

0.47
_______
0.54545
_______
0.57143


### Other Argument Types

#### Functions can take lists as input
* Functions can take lists as arguments
    * You can pass a list to a function the same way you pass any argument
    * You will need to use one of the two formats: 
        function_name(listname) or function_name([list items])

##### Example:

In [9]:
def fizz_count(x):
#Define a function that counts the number of times "fizz" is in the list
    count=0
    for item in x:
        if item=="fizz":
            count=count+1
    return count

# we can create a list with a name
fizzy_list=["fizz","cat","fizz"]
# pass list name into the function
print(fizz_count(fizzy_list))

# or we can directly pass a list into the function
print(fizz_count(["fizz","cat","fizz"]))

2
2


#### Functions can take multiple lists as input:

In [10]:
# Functions can also take multiple lists as input 
def join_lists(x,y):
    return x+y

m = [1, 2, 3]
n = [4, 5, 6]

print(join_lists(m, n))

[1, 2, 3, 4, 5, 6]


#### Functions can take a FLEXIBLE number of arguments:

<div class="alert alert-block alert-warning">

    Note that the important part of *args name isn't actually arg, it is *. https://realpython.com/python-kwargs-and-args/

In [11]:
# *args means it is expecting a range of arguments. 
def my_sum(*args):
    result = 0
    # Iterating over the Python args tuple
    for x in args:
        result += x
    return result

print("How about with FOUR arguments provided: ")
print(my_sum(1,2,3,4))
print("What about with only two arguments provided: ")
print(my_sum(1,2))
#print(my_sum())

How about with FOUR arguments provided: 
10
What about with only two arguments provided: 
3


# Modules
---
Python (and most other languages) have built-in functions that are general use.

Modules and libraries (we’ll learn about them later) are a way to address discipline-centric functions.

We can import modules that contain functions and variables:
       
        import math
        print(math.sqrt(100))

or, equivalently, we can bring in just one function from a module like so:
       
		from math import sqrt
		print(sqrt(100))

or we can import all the functions from a module using an asterisk:

        from math import *

## What are modules?
* Collection of specialized functions, data types
    * More efficient
    * Allows for code re-use
    * Acts as documentation for other programmers (or you) who read your program later
* Modularization: tools when you need them, they don’t automatically load
* Library – a module which contains groups of related functions 
* Come with documentation

## Namespace
* Scope of the name of the function
* Modules have their own namespace (the names of the functions that belong to that particular module)
    * The documentation should give you a list of names in the namespace
    * To get list of names in namespace: 

            import my_module

            print(dir(my_module))
* Conventions: 
    * Uppercase names: constants
    * _names <- are internal use only
    * _ _ names <- special meaning

## Importing Modules

In [13]:
# Our DS library stack
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

## Using Modules

In [None]:
# here is an example that calls a module named random. This is usually standard when you download Python. 
import random
# instead of importing the whole random module, you could import just ONE function from it
#from random import randint 

print("Lucky numbers! Three numbers will be generated")
print("If one of them is 5, you lose.")

count = 0
while count <3:
    num=random.randint(1,6)
    #num = randint(1, 10)
    print(num)
    if num == 5:
        print("Sorry, you lose!")
        break # the break means the else loop will be bypassed if the if condition is met
    count+=1
else:
    print("you win!")

print("See you again.")

# NumPy, Pandas, and other important DS libraries/packages
---
```
"Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation are not something that gets in the way of solving the problem: it IS the problem" - DJ Patil
```

Tidy data: 
1. each variable is in a column
2. each observation is a row
3. one table for each type of data

Clean Data: 
* data cleaning is the process of detecting and removing data records that are: incomplete, irrelevant, inaccurate, and does not contain duplicates.

## NumPy
NumPy is a Python package. It stands for 'Numerical Python' and it is ground zero for Data Science libraries. It is the basis for most of the other useful libraries in Data Science - almost everything is optimized with respect to a particular data type that NumPy introduced called an array. Two dimensional (and more dimensions) are really challenging in programming so the invention of a straightforward object by NumPy meant that a bunch of other challenges became easier to solve. That's why it is everywhere!

**Two major benefits of NumPy that we'll focus on:**

1. efficiency of the numpy array object

2. slicing, slicing, slicing (okay, and subsetting)

(3. methods associated with numpy array objects)

**But, like, specifically: Why do we care about arrays?**
* Arrays can be 1D, like lists, but they can also be 2D, like matrices, and higher-dimensional still. This allows them to represent many different kinds of numerical data (and photos can be 2D grids...). 
* Arrays can be operated on along axes. We will often want to do things like calculate the sum down each column. We do this by specifying axis=0.
* Arrays allow the expression of many numerical operations at once. 

Basic Functions of NumPy include:
1. Fast vectorized array operations for data wrangling, subsetting, filtering, transformations without having to use if/elif/else branches. Basically, **applying conditions and criteria**. 
2. Common algorithms such as sorting, unique, max, min, abs
3. Efficient descriptive statistics like filling an array from elements chosen from a normal distribution and aggregating data (probably mostly use PANDAS for that, though)
4. Slicing

### Group problem: 
1. BMI is a (rightfully) suspect term for tracking health but it is convenient and, thus, still widely used as a summary statistic.
   
   We will create two lists - first in Python and then in NumPy - and compare the speed and ease of using plain old Python to Numpy when calculating BMI (which is weight/height^2) 

    **Data:** <br>
    `height = [1.87,  1.87, 1.82, 1.91, 1.90, 1.85]` <br>
    `weight = [81.65, 97.52, 95.25, 92.98, 86.18, 88.45]` <br>

In [None]:
## in python:
height=[1.87, 1.87, 1.82, 1.91, 1.90, 1.85]
weight = [81.65, 97.52, 95.25, 92.98, 86.18, 88.45]

# Fill in with BMI variable here

# list comprehension
# [function_call for list-args-in-function in zip(lists in function)]
bmi = [round(w/h**2,2) for w,h in zip(weight, height)]
# then print it out!
bmi

[23.35, 27.89, 28.76, 25.49, 23.87, 25.84]

In [None]:
import numpy as np

# We can create arrays by using the array method from NumPy:
height_np = np.array([1.87, 1.87, 1.82, 1.91, 1.90, 1.85])
weight_np = np.array([81.65, 97.52, 95.25, 92.98, 86.18, 88.45])

# Use functions from numpy to do the calculation in one go:
BMI_np = np.round(weight_np/(np.power(height_np,2)),2)
# then print it out!
print(BMI_np)
print(BMI_np[2])

[23.35 27.89 28.76 25.49 23.87 25.84]
28.76


### Slicing with NumPy

In [None]:
# An example of slicing (since this helps demonstrate why we care about NumPy so much). 
# What happens when you have a THREE dimensional (or higher) array? 
array_3d = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
print(array_3d)
print(array_3d.shape)
print("~~~~~~~~~~")

#think about what you expect to print out here and why:
print(array_3d[1,1,2])
print("---------")

# what about here?:
print(array_3d[0,1,2])
print("~~~~~~~~~")

print(array_3d[0])
print("---------")

print(array_3d[0][1])
print("~~~~~~~~~")

print(array_3d[0][0][2])

workflow: get columns, convert to np.array, perform array level operations, assign results to df var

## Pandas

Pandas is an open source data analysis and manipulation module for Python.

It can be used to import files, such as .csv files, and work with their data in a manner similar to the one provided by R!

Pandas can be used to create dataframes, which are 2D tabular data structures with labeled axes (rows and columns). 

### Titanic Dataset

In [14]:
# here we import the Titanic data set
#---------------------
# Note: your filepath will possibly look different depending on where you placed the 
# Titanic_pandas.csv file or if you are using colab and uploading files. 
#---------------------

titanic_pandas = pd.read_csv("Titanic_pandas.csv")

# If you don't have the file saved locally, you can use the following line instead:
# titanic_pandas = pd.read_csv("https://raw.githubusercontent.com/awnorowski/BDSiC_2025/refs/heads/main/data/Titanic_pandas.csv")

# The .head() method allows us to peak at the top of the dataframe
titanic_pandas.head()

#We could also print the dataframe, but the formatting will not be as nice
#print(titanic_pandas)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### DataFrame Methods
Pandas dataframes have a number of built-in methods that allow us to quickly examine the data. For a full list of methods, see the [pandas.DataFrame documentation](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.html).

In [15]:
# Columns are a *parameter* of dataframes
print(titanic_pandas.columns)
print("\n~~~~~~~~~~~~~~~~~~~~\n")

# Some important pandas DataFrame methods.
# What do these do?
print(titanic_pandas.sample(3))
print("\n-----------\n")

print(titanic_pandas.describe())
print("\n-----------\n")

print(titanic_pandas.head(2))
print("\n-----------\n")

print(titanic_pandas.shape)

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

~~~~~~~~~~~~~~~~~~~~

      pclass  survived                              name     sex   age  sibsp  \
984      3.0       1.0  Madigan, Miss. Margaret "Maggie"  female   NaN    0.0   
1183     3.0       0.0         Salonen, Mr. Johan Werner    male  39.0    0.0   
594      2.0       0.0             Wheadon, Mr. Edward H    male  66.0    0.0   

      parch      ticket    fare cabin embarked boat  body  \
984     0.0      370370   7.750   NaN        Q   15   NaN   
1183    0.0     3101296   7.925   NaN        S  NaN   NaN   
594     0.0  C.A. 24579  10.500   NaN        S  NaN   NaN   

                             home.dest  
984                                NaN  
1183                               NaN  
594   Guernsey, England / Edgewood, RI  

-----------

            pclass     survived          age        sibsp     

### Working with a DataFrame in Pandas
Let's explore some different ways of working with data using pandas.

Two things we will investigate:
1. How to sort the data set by sex
2. How about survival? Survival is coded as: 0 (died), 1 (survived)

Let's start with the `.groupby` function.

In [None]:
# groupby is AMAZING. It emulates a lot of the functionality of SQL!
# .groupby(names of the column to group on)[column(s) on which to perform the method].method()
# remember the python tutorial example about dog breeds - they used groupby in that query!
print(titanic_pandas.groupby('pclass')['age'].mean())
#print(titanic_pandas.groupby('survived')['age'].mean())

print("~~ NOT THE SAME THING AS USING DOUBLE BRACKETS: double brackets is a dataframe, single is a series~~~")
print(titanic_pandas[['pclass','age']].mean())
print("~~~~~"*10)
print(titanic_pandas["pclass"].value_counts())

# How to sort the data by sex:
# fill in below with examples during class!
#
#

#mean age of the classes and sex: we can select for more than one column at a time: 
#print(titanic_pandas.groupby(FILL IN ))
#what about .count method?
#print(titanic_pandas.groupby(FILL IN ))
# how many survived -let's use .sum()
# FILL IN 
print("$$$$$$$$$$")
#print(titanic_pandas.groupby(FILL IN ))
print("------------aggregate function --------------")
#aggregate function!
print(titanic_pandas.groupby('age').aggregate('sum'))

#### Methods: `.unique`, `.dropna`, and `.value_counts`

In [None]:
# TITANIC data set - introducing some methods
# We can see the unique method in use for the titanic data set
print(titanic_pandas["embarked"].unique())
# We can get rid of the nan strings
print(titanic_pandas["embarked"].dropna().unique())
print(titanic_pandas["embarked"].value_counts())

#### Additional Examples

Survival of passengers $\leq$ 5 years old:
1. How many children survived who were 5 or under years old? 
2. How many children 5 or under were on the titanic in the first place?
3. Print out the class, sex and age of any children 5 or under who survived .


In [None]:
print("How many children, 5 and under, were on the titanic?")
# FILL IN WITH LOGIC
print("How many children, 5 and under, survived the sinking of the Titanic?")
# FILL IN WITH LOGIC COMMAND

### Plotting data
Here are some examples of plotting data using [DataFrame.plot](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.plot.html) and matplotlib.pyplot.

In [None]:
# a few basic plots with Titanic: 
titanic_scatter=titanic_pandas.plot.scatter('age', 'fare', c = "blue", s = 5, xlabel="Age (years)", ylabel="Fare (pounds, 1912 prices)")

In [None]:
titanic_hist = titanic_pandas.fare.plot.hist(bins = 40, color = 'green',xlabel='Fare(pounds, 1912 prices)')

In [None]:
# bar plot
contingency_titanic = titanic_pandas.groupby(['pclass', 'survived']).size().unstack()
titanic_barplot = contingency_titanic.plot.bar(stacked=True, xlabel="Passenger class",ylabel="Counts",color = ["lightblue", "darkblue"])
#titanic_barplot = contingency_titanic.plot.bar(color = ["lightblue", "darkblue"])
#plt.ylabel("Counts")
#plt.xlabel('Passenger class')
plt.xticks(rotation=0)
#plt.show(titanic_barplot)

In [None]:
contingency_titanic = titanic_pandas.groupby(['pclass',"sex","survived"]).size().unstack()
titanic_barplot = contingency_titanic.plot.bar(stacked=True, 
                                               color = ["lightblue", "darkblue"])
plt.ylabel("Counts")
plt.xlabel('Passenger class')
plt.xticks(rotation=0)
#---------------------
plt.savefig('Titanic_tester.png')
plt.show(titanic_barplot)