# Assignment 3: Advanced Python (30 pt)

This assignment covers materials from the material on loops, functions, and NumPy lectures.

Note that these questions are longer and somewhat more open ended than previous assignments. Please reach out if you need assistance getting started.

Feel free to create as many Python or Markdown cells as you desire to answer the questions.

## Question 1: For loops (10 pts)

Below, we have a nested dictionary structure containing information about several species ranging from vulnerable to critically endangered. Note than in some cases, species populations are listed as `None`. This means that the wild populations of these species are unknown. 

Use for loops to accomplish the following tasks: 

- Create a data structure containing all unique types of "Threats". This variable should not contain duplicate entries. Print the structure (2 pt).
- Create a list of all of the species listed as "Critically Endangered". Print the list (2 pt).
- Create a separate list containing the names of species with populations with fewer than 50 individuals and species with unknown population sizes. Print the list (3 pt).
- Find the species with the largest population size. Print this species name, and what its population size is (3 pt).

If you hard code the solutions (e.g. manually pick out which species has the largest population) you will receive NO points.

In [3]:
conservation_data = {
    "Giant Panda": {
        "Status": "Endangered",
        "Population": 1800,
        "Threats": ["Habitat loss", "Poaching"]
    },
    "Mountain Gorilla": {
        "Status": "Critically Endangered",
        "Population": 1063,
        "Threats": ["Habitat loss", "Poaching", "Civil unrest"]
    },
    "Amur Leopard": {
        "Status": "Critically Endangered",
        "Population": 84,
        "Threats": ["Habitat loss", "Poaching"]
    },
    "Vaquita": {
        "Status": "Critically Endangered",
        "Population": 10,
        "Threats": ["Bycatch in fishing nets"]
    },
    "African Elephant": {
        "Status": "Vulnerable",
        "Population": 415000,
        "Threats": ["Habitat loss", "Poaching"]
    },
    "Javan Rhino": {
        "Status": "Critically Endangered",
        "Population": 72,
        "Threats": ["Habitat loss", "Poaching"]
    },
    "Sumatran Orangutan": {
        "Status": "Critically Endangered",
        "Population": 14600,
        "Threats": ["Habitat loss", "Poaching"]
    },
    "Hawksbill Turtle": {
        "Status": "Critically Endangered",
        "Population": None,
        "Threats": ["Habitat loss", "Poaching"]
    },
    "Saola": {
        "Status": "Critically Endangered",
        "Population": None,
        "Threats": ["Habitat loss", "Poaching"]
    },
    "Iberian Lynx": {
        "Status": "Endangered",
        "Population": 94,
        "Threats": ["Habitat loss", "Poaching"]
    }
}


In [4]:
"""""
This is a function that captures all unique threats.

"""""

unique_threats = set() #use 'set' as the data structure

for k, v in conservation_data.items(): #Loop over dictionary
    for threat in v['Threats']: #Loop thru each animal's threats
        unique_threats.add(threat) #if threat not already in set, add it.

print(unique_threats) #print the set of unique threats



{'Civil unrest', 'Habitat loss', 'Poaching', 'Bycatch in fishing nets'}


In [5]:
"""""
Create a list of all of the species listed as "Critically Endangered". Print the list. 
"""""

critical = set()
for k, v in conservation_data.items(): #Loop over dictionary
    if v['Status']== "Critically Endangered": #check if animal's status is 'Critically Endangered'.
        critical.add(k) #if it is, add to the set. 
print(list(critical)) #print the set as a list.


['Saola', 'Javan Rhino', 'Sumatran Orangutan', 'Vaquita', 'Mountain Gorilla', 'Hawksbill Turtle', 'Amur Leopard']


In [6]:
"""""
Create a separate list containing the names of species with populations with fewer than 50 individuals and species with unknown population sizes. Print the list. 
"""""

small_population = set()
for k, v in conservation_data.items(): #Loop over dictionary
    if (v['Population']is None) or (v['Population']<50): #either population is not known or less than 50
        small_population.add(k) #if so, add animal to the set
print(list(small_population)) #print the set as a list. 


['Vaquita', 'Saola', 'Hawksbill Turtle']


In [9]:
"""""
Find the species with the largest population size. Print this species name, and what its population size is 
"""""

max_population = 0 #dont know largest population, yet
species = None # dont know that species yet, either
for k, v in conservation_data.items(): #Loop over dictionary
    if (v['Population'] is not None): #have to know the population size
        if v['Population'] > max_population: #if it is the largest (so far)
            species = k #remember this species
            max_population = v['Population'] #reset max. so far to be this population size 
    
#report species with the largest population in the entire data set
print(f"The {species} has the largest population with {max_population} animals.")

The African Elephant has the largest population with 415000 animals.


## Question 2: Functions (10 pt)

When considering the health of an ecosystem, an important concept to quantify is the diversity of that system. There are several metrics commonly used to calculate ecosystem diversity, one of which is call Simpson's Diversity Index.

This metric not takes into account how many species are present in an location, but also if one species has far more individuals than other species. For example, an ecosystem with 500 species but only one species above 10 individuals is not that diverse.

We can calculate Simpson's Diversity ($D$) as follows:

$D = 1 - [(\frac{n_1}{N})^2 + (\frac{n_2}{N})^2 + (\frac{n_3}{N})^2 + ...]$

For example, if an ecosystem has four species with 5, 2, 2, and 1 individuals (10 individuals total), you can calculate $D$ like this:

$D = 1 - [(\frac{5}{10})^2 + (\frac{2}{10})^2 + (\frac{2}{10})^2 + (\frac{1}{10})^2] = 0.66$

Define a function that calculates and returns $D$ given a list of species population levels, and run the function on several example lists (3 pt).

Your answer should work for a list of **any** length (1 pt).

Add documentation to the function that describes what it does, the desired parameters, and what data types the parameters should be (2 pt).

Within the function, check that the input is a list. If the input is not a list, give a custom error message (2 pt).

Also, make sure all entries in the list are integers. If there are floats, convert them to integers. If there are entries that are not floats or integers, give a custom error message (2 pt).



In [15]:
# example_input = [1882, 400, 321, 24]
"""""
Define a function that calculates and returns $D$ given a list of species population levels, and run the function on several example lists
"""""
def diversity_index(populations: list[int]) ->float: 
    #Within the function, check that the input is a list. If the input is not a list, give a custom error message
    if type(populations) != list: 
        print("You must supply a list of integers.")
        return
    N = 0 #initialize N
    sum = 0
    for entry in populations: #Loop over the population lists of "any" Length
        """""
        Also, make sure all entries in the list are integers. If there are floats, 
        convert them to integers. If there are entries that are not floats or integers, give a custom error message
        """""
        if type(entry) not in [float, int]:
            print(f"Your list includes a non-number: {entry}.")
            return #quit the function because the list is invalid.
        e = int(entry) #guarantees entry is an integer 
        sum += e**2 #add one numerator term 
        N += e #add entry to N 
    D = 1 - (sum/N**2) #the Diversity Index

    return D 


In [16]:
populations = [1882, 400, 321, 24]
diversity_index(populations)


0.4485625467948795

In [17]:
populations = {1882, 400, 321, 24}
diversity_index(populations)


You must supply a list of integers.


In [18]:
# try the function several times with non-integers and one invalid list. 

examples = {1: [64982.7, 82348, 827, 34938.2], 2:[723649, 762, 72.4, 627], 3:[82346, 3282, 82389.5, 8289],
            4: [555, 51636.5, 'q', 78.887]}
for k, v in examples. items(): #Loop across the example lists
    print(f"Simpson's Diversity Index for example {k}...{v}:", end=' ')
    if diversity_index(v) is not None: #if the function can compute D 
        print(f"{diversity_index(v): .3f}") #print D for that list 

Simpson's Diversity Index for example 1...[64982.7, 82348, 827, 34938.2]:  0.635
Simpson's Diversity Index for example 2...[723649, 762, 72.4, 627]:  0.004
Simpson's Diversity Index for example 3...[82346, 3282, 82389.5, 8289]:  0.561
Simpson's Diversity Index for example 4...[555, 51636.5, 'q', 78.887]: Your list includes a non-number: q.


## Question 3: Simulating data (10 pt)

In data analysis, we often simulate data to help test our predictions and get a feel for how the real data should be. This questions asks you to use the functions found in `numpy.random` to simulate rolling.

Define a function called `dice_simulator()` with an integer parameter called `n`. This function should create a list of integers 1 through 6 and randomly sample this list with replacement `n` times. The function should return the `n` samples as a list or numpy array. Note that `n` should be a positive integer (2 pt).

Define a function called `proportions()` to calculate what proportion of the "rolls" that are 1s, 2s, 3s, 4s, 5s, and 6s. Print these 6 proportions. `proportions()` should have a single parameter called `rolls`, which should take in the output of `dice_simulator()` (3 pt).

Define a function called `three_streak()` to calculate the maximum number of times 3 was "rolled" in a row and print this value. To be in a row, the 3's have to be next to each other in a list (such as if `rolls[1]` and `rolls[2]` are both 3). Like `proportions()`, `three_streak()` should have a single parameter called `rolls`, which should take in the output of `dice_simulator()` (3 pt). 
- *Hint: `max()` is a built in function in Python that finds the largest value in a list.*

Define a function called `simulation()` that calls `dice_simulator()`, `proportions()`, and `three_streak()`. Make sure that `proportions()` and `three_streak()` are called so that they use the same dice rolls. `simulation()` should take a single parameter `n` that is fed into `dice_simulator()`. Have this function print the value of n, as well (1 pt). 

Call `simulation()` several times with the `n` parameter at different values (1 pt). 





In [6]:
import numpy as np

"""""
Define a function called `dice_simulator()` with an integer parameter called `n`. This function should create a list of integers 1 through 6 and randomly sample 
this list with replacement `n` times. The function should return the `n` samples as a list or numpy array. Note that `n` should be a positive integer
"""""

def dice_simulator(n: int) -> list[int]:

    if n <= 0 or (type(n) != int):
        print(f"(n) needs to be a positive integer.")
        return []

    sides = [1,2,3,4,5,6] #create a list of integers 1 through 6
    return list(np.random.choice(sides,n,replace=True)) #randomly sample with replacement and return the 'n' samples as a list

"""""

Define a function called `proportions()` to calculate what proportion of the "rolls" that are 1s, 2s, 3s, 4s, 5s, and 6s. 
Print these 6 proportions. `proportions()` should have a single parameter called `rolls`, which should take in the output of `dice_simulator()`.

"""""

def proportions(rolls: list[int]):

    rolls_of_k = {1: 0,2: 0, 3: 0, 4: 0, 5: 0, 6: 0} # a dictionary counts possible rolls
    proportion_of_k = {1: 0.0,2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0} # a dictionary reports proportions of rolls

    total_rolls = len(rolls)
    for k in rolls_of_k.keys(): #for each "possible" roll
        for j in range(len(rolls)): #for each *stimulated* roll
            if rolls[j] == k:
                rolls_of_k[k] += 1 #adds 1 to roll count 
    for k in proportion_of_k.keys(): #for each *possible* roll
        proportions_of_k = rolls_of_k[k] / total_rolls #calculate proportion of rolls
        print(f"(k) was rolled {100*proportions_of_k:.1f})% percent of time.") #print each proportion


In [8]:
rolls = dice_simulator(100)
proportions(rolls)


(k) was rolled 16.0)% percent of time.
(k) was rolled 19.0)% percent of time.
(k) was rolled 20.0)% percent of time.
(k) was rolled 14.0)% percent of time.
(k) was rolled 15.0)% percent of time.
(k) was rolled 16.0)% percent of time.


In [19]:
"""""
Define a function called `three_streak()` to calculate the maximum number of times 3 was "rolled" in a row and print this value. To be in a row, 
the 3's have to be next to each other in a list (such as if `rolls[1]` and `rolls[2]` are both 3). 
Like `proportions()`, `three_streak()` should have a single parameter called `rolls`, which should take in the output of `dice_simulator()`
"""""
def three_streak(rolls: list[int]):

    all_three_streaks = [] #a list to keep track of streak lengths
    streak = 0 #at start, there is no streak
    for j, roll in enumerate(rolls): #Loop over stimulated rolls
        if roll == 3:
            if (j==0) or (streak == 0): #start or re-start counting 
                streak = 1
            if (rolls[j-1] == roll): #streak still alive
                streak +=1
        else: #roll is not a '3'
            if streak > 0:
                all_three_streaks.append(streak) #add streak length to list
            streak = 0 #restart counter; the streak is over
    if len(all_three_streaks) > 0:
        print(f"The longest streak of 3's is: {max(all_three_streaks)} rolls.")
    else: 
        print("A three was never rolled!")


In [20]:
rolls = [5,5,5,6,5,3,3,3,3,3,3,2,3,5,6,7,5,4,3,2,1,4,5]
three_streak(rolls)

The longest streak of 3's is: 6 rolls.


In [27]:
"""""
Define a function called `simulation()` that calls `dice_simulator()`, `proportions()`, and `three_streak()`. Make sure that `proportions()` and `three_streak()` are called so that they use the same dice rolls. 
`simulation()` should take a single parameter `n` that is fed into `dice_simulator()`. Have this function print the value of n, as well.
"""""

def simulation(n: int):

    print(f"This is a simulation of {n} dice rolls.")
    rolls = dice_simulator(n) #simulate n dice rolls
    if len(rolls) > 0: #in case 'n' was a bad input
        proportions(rolls) #compute their proportions
        three_streak(rolls) #reports streaks of 3's



In [28]:
"""
Call `simulation()` several times with the `n` parameter at different values

"""

simulation(75)


This is a simulation of 75 dice rolls.
(k) was rolled 14.7)% percent of time.
(k) was rolled 22.7)% percent of time.
(k) was rolled 9.3)% percent of time.
(k) was rolled 10.7)% percent of time.
(k) was rolled 21.3)% percent of time.
(k) was rolled 21.3)% percent of time.
The longest streak of 3's is: 2 rolls.


In [29]:
simulation(343)

This is a simulation of 343 dice rolls.
(k) was rolled 13.7)% percent of time.
(k) was rolled 13.4)% percent of time.
(k) was rolled 19.0)% percent of time.
(k) was rolled 19.0)% percent of time.
(k) was rolled 17.2)% percent of time.
(k) was rolled 17.8)% percent of time.
The longest streak of 3's is: 3 rolls.


In [31]:
simulation(10000)

This is a simulation of 10000 dice rolls.
(k) was rolled 16.8)% percent of time.
(k) was rolled 17.5)% percent of time.
(k) was rolled 16.2)% percent of time.
(k) was rolled 16.4)% percent of time.
(k) was rolled 16.6)% percent of time.
(k) was rolled 16.4)% percent of time.
The longest streak of 3's is: 5 rolls.


In [32]:
simulation(A) #does not work without an integer 

NameError: name 'A' is not defined