# Project Activity: More Penguins!

This Discussion activity is a component of your [group mini-project](https://philchodrow.github.io/PIC16A/project/). While the usual Discussion expectations apply with regards to your participation grade (i.e. if you work for the full 50 minutes, you will get full credit), it is recommended for the purposes of your final project that you coordinate with your group to eventually complete all parts of this assignment. 

## Group Roles

The roles for this Discussion activity are slightly modified. The Driver and Proposer are the same as usual. Instead of a Reviewer, use an **Interpreter**. The job of the Interpreter is to think about the significance of each of the code outputs in the context of the long-term project goal (classifying penguin species). In parts of the Discussion where the problems ask you to explain or interpret your findings, the Interpreter should suggest responses to the Proposer and Driver. The **Interpreter** may also give code feedback when the group is writing functions. 

### List Names and Group Roles Here: 

- Partner 1 (Role)
- Partner 2 (Role)
- Partner 3 (Role)


## Part A

Run the following cell to load the penguin dataset as a `pandas` `DataFrame` called `penguins`. I've also supplied code to shorten the penguins species name for convenient exploration and plotting. 

*If you experience `ConnectionRefused` errors when doing this, instead copy/paste the url into your browser. Save the data in the same directory as this notebook in a file called `penguins.csv`, and then replace `url` with `"penguins.csv"` in the block below.* 

While working with this dataset, you may notice some blank or nonsensical values.  Normally, for a project such as this, we would want to remove these values before we continue.  However, in this worksheet you can just ignore them.

In [None]:
import pandas as pd
import numpy as np
import urllib
from matplotlib import pyplot as plt

url = "https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

# shorten the species name
penguins["Species"] = penguins["Species"].str.split().str.get(0)

In [None]:
# optional code here if you need to refresh your memory of the data set


## Part B

Write a function called `penguin_summary_table` which accepts two arguments, `group_cols` and `value_cols`. This function should create a table in which the mean of each element of `value_cols` is shown, grouped according to the specified `group_cols`. For example, the call 

```python
penguin_summary_table(["Species"], ["Culmen Length (mm)", "Culmen Depth (mm)"])
```

should produce a summary table with the mean culmen length and depth per species. 

For a more pleasant display, **round the numbers in your table to 2 decimal places**. This can be done using the code `my_data_frame.round(2)`. 

This function can be implemented in just a few lines. Comments and docstrings are not necessary. 

In [None]:
# your solution here


## Part  C

Use your function to explore the data a bit. Focus on the physiological variables:

- `Culmen Length (mm)`
- `Culmen Depth (mm)`
- `Flipper Length (mm)`
- `Body Mass (g)`
- `Delta 15 N (o/oo)`
- `Delta 13 C (o/oo)`

These last two variables are measures of nitrogen and carbon isotopes in the penguin's bloodstreams. 

**Create at least three readable summary tables.** Then, work with your **Interpreter** to explain the significance of each table. Do observe any important differences between the penguin species?

Make sure that each table has a message, and that no information is shown that is not part of that message. Is there a part of the table that you have nothing to say about? Remove it! 


- **Hint**: "This table suggests that there's not much of a difference between..." is a fine explanation of the table, as long as it's warranted. 
- **Hint**: consider the sex of the penguins as well as the species. 
    - There is a single penguin whose sex was not collected by the researchers and encoded as `.`. This should not cause major problems, but feel free to remove this row if you'd like to. 

In [None]:
# Table 1


#### Discussion of Table 1

In [None]:
# Table 2


#### Discussion of Table 2

In [None]:
# Table 3


#### Discussion of Table 3

## Part D

Based on your findings from these tables, propose a miniature decision tree to help distinguish between the penguin species. Your decision tree might have rules like the following: 

1. First, check the island on which the penguin was found. 
    1. If Torgersen, then check the body mass. 
        1. If the body mass is over 4,000g, then guess Adelie. 
        1. Otherwise, guess Chinstrap
    1. If Biscoe, then check the sex of the penguin. 
        1. If female, guess Gentoo
        1. Otherwise, guess Chinstrap
    1. If Dream, then guess Adelie.     
      
Your decision tree should operate using no more than three columns from the data. 

Below your decision tree, write an explanation of how you came up with it and how the tables that you created above informed your choices. 

If you like, you may skip ahead to Part E and write your decision tree directly as a Python function. You should then explain your reasoning as a docstring in the function rather than typing it here.  

## Part E 

Write a function called `decision_tree` that implements your decision tree. It should accept as input single values of the relevant variables, and then return as output the guessed species of a penguin. Here's an example for the decision tree above: 

```python
def decision_tree(island, mass, sex):
    if island == "Torgersen":
        if mass > 4000:
            return "Adelie"
        else:
            return "Chinstrap"
    elif island == "Biscoe":
        if sex == "FEMALE":
            return "Gentoo"
        else:
            return "Chinstrap"
    else: 
        return "Adelie"
    
decision_tree("Biscoe", 5000, "MALE")
```
```
'Chinstrap'
```

Comments and docstrings are not necessary in this case, unless you skipped Part D. 

In [None]:
# your decision tree function here



## Part F

The following code will generate a guess for each penguin using the `decision_tree` function shown above. Modify the line that defines the `guesser` function according to the variables required by your decision tree. Then, run the code to create a new column called `Guess` containing the species guess for each penguin. 

In [None]:
# modify the first line, then run
guesser = lambda r: decision_tree(r["Island"], r["Body Mass (g)"], r["Sex"])
penguins["Guess"] = penguins.apply(guesser, axis = 1)

## Part G

Compute the accuracy of your decision tree -- what percentage of the time does your decision tree give you the right answer? 

**Hint**: this is a one-liner. 

In [None]:
# your solution here


Soon, we'll learn how to use Python to automatically generate good decision trees without us needing to eyeball the data. 