# DSC 80: Lab 02

### Due Date: Tuesday January 21st, at 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the homework problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab02.py` file, that will be imported into the current notebook.

Homeworks and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).


**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.
- If autograder failed, check to make sure there's no syntax errors with the doctests!

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab02.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import lab02 as lab

In [3]:
import os
import pandas as pd
import numpy as np

## Pandas Basics

---

**Question 1: Test scores**

You will be given a small dataset (so that you can manually check the correctness of your code). Please follow a few requirements when solving the problems below:

* For all questions you need to write code general enough to be applied to another similar dataset. 
* Do not hard-code any answers please. 
* Do not use `for` or `while` loops
   

1. Write a function called `data_load` that takes a file name of the data set to be read as a string and returns a dataframe following the steps below:

    a. Read only a subset of columns: `name`, `tries`, `highest_score`, `sex`
    
    b. Then you realized that for your analysis the column `sex` is not needed. Remove it. 
    
    c. You want to customize the column names: rename `name` to `firstname` and `tries` to `attempts`
    
    d. Turn the `firstname` column into the index.


2. Write a function `pass_fail` that takes the dataframe returned from the function above and adds a column `pass` based on the following conditions:

    * "Yes" if a number of attempts is strictly less than 3 and the score is >= 50
    * "Yes" if a number of attempts is strictly less than 6 and the score is >= 70
    * "Yes" if a number of attempts is strictly less than 10 and the score is >= 90
    * "No" otherwise
 
Your function should return the (modified) input dataframe with the added column.
    
3. Write a fuction `av_score` that takes in a dataframe from the question above and returns the average score for those students who passed the test. 
    
4. Write a function `highest_score_name` that takes in the dataframe from question 1.2 and returns a dictionary, where the key is the highest score and the value is the name (as a list) of the person with the highest score (attempts do not count). If more than one student got the highest score, include all names in a list. 

5. Write a function `idx_dup` that does not take any parameters and returns a single integer, answering the question below:

Is it possible for a dataframe's index to have duplicate values?
1. No, the index values must be unique and uses non-negative integers only, just like in numpy arrays
2. No, the index values must be unique and uses integers only
3. No, the index values must be unique but index values are not restricted to integers
4. Yes, but index values must be non-negative integers only
5. Yes, but index values must be integers only
6. Yes and index values are not restricted to integers
    


In [4]:
scores_fp = os.path.join('data', 'scores.csv')

In [14]:
def av_score(scores):
    modified = scores
    return np.mean(modified[modified['pass'] == 'Yes']['highest_score'].values)

In [16]:
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
scores = pass_fail(scores)
av = av_score(scores)
isinstance(av, float)
91<av<92

True

In [17]:
def highest_score_name(scores):
    modified = scores
    hold = max(modified['highest_score'].values)
    content = modified[modified['highest_score'] == hold].index.tolist()
    return {hold:content}

In [19]:
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
scores = pass_fail(scores)
highest = highest_score_name(scores)
isinstance(highest, dict)
len(next(iter(highest.items()))[1])

3

In [10]:
def pass_fail(scores):
    data = scores
    case_1 = (data['attempts'].values < 3) & (data['highest_score'].values >= 50)
    case_2 = (data['attempts'].values < 6) & (data['highest_score'].values >= 70)
    case_3 = (data['attempts'].values < 10) & (data['highest_score'].values >= 90)
    choice = case_1 | case_2 | case_3
    data['pass'] = choice
    def helper(case):
        if case:
            return 'Yes'
        else:
            return 'No'
    other = data['pass'].apply(helper).values
    data['pass'] = other
    return data

In [13]:
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
scores = pass_fail(scores)
isinstance(scores, pd.DataFrame)
len(scores.columns)
scores.loc["Julia", "pass"]=='Yes'

True

In [6]:
def data_load(scores_fp):
    # a
    subset = pd.read_csv(scores_fp,usecols = ['name', 'tries', 'highest_score', 'sex'])

    # b
    remove_sex = subset.drop('sex',axis=1)

    # c
    customized_col = remove_sex.rename(columns = {'name':'firstname','tries':'attempts'})

    # d
    first_name = customized_col.set_index('firstname')

    return first_name

In [7]:
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
isinstance(scores, pd.DataFrame)

True

In [8]:
list(scores.columns)

['attempts', 'highest_score']

In [9]:
isinstance(scores.index[0], int)

False

In [20]:
def idx_dup():
    return 6


## Tricky Pandas.

Sometimes you can get input that you do not expect. The next set of questions walk you through a few examples that might surprise you. 

---
**Question 2 : Duplicate and selection**



1. Write a function `trick_me` that does not take any parameters. <br>Inside the function: 
    * Create a dataframe `tricky_1` that has three columns labeled: "Name", "Name", "Age". Your table should have 5 rows, the values are up to you. 
    * Save this dataframe in the `csv` file called `tricky_1.csv` without the index. 
    * Now create another dataframe, `tricky_2`, by reading in the file `tricky_1.csv `. What are your observations?
        1. It was not possible to create a dataframe with the duplicate columns
        2. `tricky_1` and `tricky_2` have the same column names
        3. `tricky_1` and `tricky_2` have different column names
    * Return your answer as a letter
    
    

2. Write a function `reason_dup` that answers the following question: `Why does pandas allow us to have duplicate column names?` by returning a corresponding letter. 
    1. It does not, duplicate column names are not allowed
    2. Since duplicate indices are allowed and we also can transpose a dataframe.
    3. It is a bug in Pandas
    
    
   
   
3. Write a function `trick_bool` that does not take any parameters. To determine the correct answers from the list below, you should follow the steps outlined by experimenting in *the notebook* (or a python REPL). Outside the function:
    * Create a dataframe `bools` that has four columns labeled: "True", "True", "False", "False". Each column name is boolean.
    * Your table should have 4 rows, the values are up to you. 
    * You need to think (without running it) what output you should get when running each line of code below. Pick a corresponding answer from a given list. Your function should return a list with three letters that correspond to the dataframe structure for each line below. 
    
     ```
     df[True]
     df[[True, True, False, False]]
     df[[True, False]]
     ```
    
        1. Dataframe: 2 columns, 1 row
        2. Dataframe: 2 columns, 2 rows
        3. Dataframe: 2 columns, 3 rows
        4. Dataframe: 2 columns, 4 rows
        5. Dataframe: 3 columns, 1 rows
        6. Dataframe: 3 columns, 2 rows
        7. Dataframe: 3 columns, 3 rows
        8. Dataframe: 3 columns, 4 rows
        9. Dataframe: 4 columns, 1 rows
        10. Dataframe: 4 columns, 2 rows
        11. Dataframe: 4 columns, 3 rows
        12. Dataframe: 4 columns, 4 rows
        13. Error
    
    
4.  Write a function `reason_bool` that answers the following question: `Why the outputs are the way they are?` by returning a corresponding letter. 
    1. booleans arrays select either rows or columns, randomly
    2. booleans arrays always select rows by default
    3. booleans arrays always select columns by default 
    4. booleans arrays always select rows by default, unless column names are set to `True`/`False` values.
    
    
    
   


In [40]:
def trick_me():
    tricky_1 = pd.DataFrame([['Bonnie','',''],['Cathy','1','2'],['Norry','3','4'],['0','1','2'],['7','8','9']],columns = ["Name", "Name", "Age"] )
    tricky_1.to_csv('tricky_1.csv',index = False)
    tricky_2 = pd.read_csv('tricky_1.csv')
    return 'C'

In [41]:
ans =  trick_me()
ans

'C'

In [42]:
def reason_dup():
    return 'B'

In [43]:
def trick_bool():
    return ['D','J','M']

In [45]:
ans =  trick_bool()
isinstance(ans, list)
isinstance(ans[1], str)

True

In [46]:
def reason_bool():
    return 'D'

---
**Question 3 : np.NaN in a dataframe**


In the notebook, use the code given below to create a dataframe called `nans`. Note that we use `np.NaN` (`numpy`'s representation of 'Not a Number') to create missing values.
 
```
nans = pd.DataFrame([[0,1,np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
```
Now you decided to make your dataset more readable for people who do not understand `NaN` and replace it with a `MISSING` string instead. In order to do that you wrote the following function:

```
def change(x):
    if x == np.NaN:
        return "MISSING"
    else:
        return x
```

* Write a line of code that applies the function above to the last column of the `nans` dataframe. 
* What was a result?
    * A: It worked: all np.NaNs in the last columns where changed to "MISSING"
    * B: It did not work: does not matter how I tried, the NaN values were not changed.
    
I expect you to answer `B` here. What had happened? Turns out, you can't use simple comparison `==` to detect if a value is `np.NaN`. You need to use another way to compare a variable to a `np.NaN`, read about it [here](https://stackoverflow.com/questions/41342609/the-difference-between-comparison-to-np-nan-and-isnull)

1. Modify the function `change` above to work as expected.
2. Write method `correct_replacement` that takes in a dataframe like `nans` and returns a modified dataframe, where all the `NaN` are replaced with `"MISSING"`. Use your corrected version of `change` to do this. **The pandas function .fillna is not allowed in this question.** 


In [47]:
def change(x):
    if pd.isnull(x):
        return "MISSING"
    else:
        return x

In [49]:
change(1.0) == 1.0
change(np.NaN) == 'MISSING'

True

In [54]:
nans = pd.DataFrame([[0,1,np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
nans

Unnamed: 0,0,1,2
0,0.0,1.0,
1,,,
2,1.0,2.0,3.0


In [55]:
def correct_replacement(nans):
    def change(x):
        if pd.isnull(x):
            return "MISSING"
        else:
            return x
        
    return nans.applymap(change)

In [56]:
nans = pd.DataFrame([[0,1,np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
A = correct_replacement(nans)
(A.values == 'MISSING').sum() == 4

True

---

### Summary Statistics

**Question 4**

In this question you will create two general purpose functions that make it easy to 'qualitatively' assess the contents of a dataframe.

1. Create a function `population_stats` which takes in a dataframe `df` and returns a dataframe indexed by the columns of `df`, with the following columns:
    * `num_nonnull` contains the number of non-null entries in each column,
    * `pct_nonnull` contains the proportion of entries in each column that are non-null,
    * `num_distinct` contains the number of distinct entries in each column,
    * `pct_distinct` contains the proportion of (non-null) entries in each column that are distinct from each other.
    
*Note*: you may find the `.nunique()` series method useful.

*Note*: The number of distinct entries does not include nulls.
    
2. Create a function `most_common` which takes in a dataframe `df` and a number `N` and returns a dataframe of the `N` most-common values (and their counts) for each column of `df`. Any column with fewer than `N` distinct values should contain `NaN` in those entries. 

*Note*: you can loop through the *columns* of `df` to construct your output. You should **not** be looping through rows.

For example, for the subset of the `salaries` dataframe with columns 'Job Title' and 'Status' from lecture one (left), `most_common(salaries, N=5)` is given (right). 

<table><tr>
    <td><img src="imgs/dataframe.png" width="70%"/></td>
    <td><img src="imgs/most_common.png" width="70%"/></td>
</tr></table>

In [59]:
def population_stats(df):
    index = df.columns.values
    num_nonnull =[]
    pct_nonnull= []
    num_distinct = []
    pct_distinct = []
    
    number = df.shape[0]
    
    def change(x):
        return not pd.isnull(x)
    def fill(x):
        if pd.isnull(x):
            return False
        else:
            return True
        
    for i in df.columns:
        col = df[i].apply(change).values
        count = np.count_nonzero(col)
        num_nonnull.append(count)
        pct_nonnull.append(count/number)
        num_distinct.append(df[i].nunique())
        pct_distinct.append(pd.Series(list(filter(fill,df[i].values.tolist()))).nunique()/number)
    return pd.DataFrame({'num_nonnull':num_nonnull,'pct_nonnull':pct_nonnull,
                         'num_distinct':num_distinct,'pct_distinct':pct_distinct},index = ['A', 'B', 'C', 'D'])

In [118]:
data = np.random.choice(range(10), size=(100, 4))
df = pd.DataFrame(data, columns='A B C D'.split())
out = population_stats(df)
out.index.tolist() == ['A', 'B', 'C', 'D']
out

Unnamed: 0,num_nonnull,pct_nonnull,num_distinct,pct_distinct
A,100,1.0,10,0.1
B,100,1.0,10,0.1
C,100,1.0,10,0.1
D,100,1.0,10,0.1


In [62]:
(out['num_distinct'] <= 10).all()
(out['pct_nonnull'] == 1.0).all()

True

In [64]:
def most_common(df, N=10):
    result = pd.DataFrame()
    for i in df.columns:
        con = df[i].value_counts()
        if N > len(con):
            col_1 = np.array(con.index.tolist() + [np.NaN] * (N-len(con)))
            col_2 = np.array(con.values.tolist() + [np.NaN] * (N-len(con)))
            result[i + '_values'] = col_1
            result[i + '_counts'] = col_2
        else:
            col_1 = np.array(con.index.tolist()[:N])
            col_2 =  np.array(con.values.tolist()[:N])
            result[i +'_values']= col_1
            result[i +'_counts']= col_2
    return result

In [68]:
data = np.random.choice(range(10), size=(100, 2))
df = pd.DataFrame(data, columns='A B'.split())
out = most_common(df, N=3)
out.index.tolist() == [0, 1, 2]
out.columns.tolist() == ['A_values', 'A_counts', 'B_values', 'B_counts']
out
out['A_values'].isin(range(10)).all()

True

## Faulty Scooters

**Question 5**

A new electric scooter company 'Maxwell Scooters' opened a retail shop in La Jolla recently and 300 UCSD students bought new scooters for getting around campus. After 8 students start complaining their scooters are faulty, negative on-line reviews for the scooters start to spread. In response, the scooter company adamantly claims that 99% of their scooters come off the production line working properly. You think this seems unlikely and decide to investigate.

* Select a significance level for you investigation. (Not to be turned in)
* What are reasonable choices for the *Null Hypothesis* for your investigation? Select all that apply:
    1. The scooter company produces scooters that are 99% non-faulty.
    2. The scooter company produces scooters that are less than 99% non-faulty.
    3. The scooter company produces scooters that are at least 1% faulty.
    4. The scooter company produces scooters that are ~2.6% faulty.

Return your answer in a function `null_hypoth` that takes zero arguments.

* Create a function `simulate_null` simulates a single step of data generation under the null hypothesis. The function should return a binary array.

* Create a function `estimate_p_val` that takes in a number `N` and returns the estimated p-value of your investigation upon simulating the null hypothesis `N` times.

*Note*: Plot the Null distribution and your observed statistic to check your work.

In [None]:
def null_hypoth():
    return [1]

In [70]:
def simulate_null():
    sig = 0.99
    stats = np.random.binomial(300, sig)
    answer = np.array(stats *[1] + (300 - stats)*[0])
    return answer

In [71]:
pd.Series(simulate_null()).isin([0,1]).all()

True

In [81]:
def estimate_p_val(N):
    lst = []
    for i in range(N):
        lst.append(np.count_nonzero(simulate_null()))
    rest = np.array(lst) < 292 #after 8 students
    count = np.count_nonzero(rest)/len(rest)
    return count

In [83]:
0 < estimate_p_val(1000) < 0.1
estimate_p_val(1000)

[autoreload of lab02 failed: Traceback (most recent call last):
  File "//anaconda3/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "//anaconda3/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 434, in superreload
    module = reload(module)
  File "//anaconda3/lib/python3.7/imp.py", line 314, in reload
    return importlib.reload(module)
  File "//anaconda3/lib/python3.7/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 630, in _exec
  File "<frozen importlib._bootstrap_external>", line 724, in exec_module
  File "<frozen importlib._bootstrap_external>", line 860, in get_code
  File "<frozen importlib._bootstrap_external>", line 791, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/bonnie117/Documents/GitHub/dsc80-wi20/labs/lab02/lab02.py", line 1

0.003

# Super-Heroes

The questions below analyze a dataset of super-heroes found in the `data` directory. One of the datasets have a list of attributes on each super-hero, while the other is a *boolean* dataframe of which super-heroes have which super-powers. Note, the datasets contain information on both *good* super-heroes, as well as *bad* super-heroes (AKA villains). 

### Super-hero powers

**Question 6**

Now read in the dataset of super-hero powers in the `data` directory. Create a function `super_hero_powers` that takes in a dataframe like `powers` and returns a list with the following three entries:

1. The name of the super-hero with the greatest number of powers.
2. The name of the most common super-power among super-heroes whose names begin with 'M'.
3. The most popular super-power among those with only one super-power.

You should *not* be hard-coding your answers in this question; your function should work on any dataset similar to `powers`. You should not be using loops in this question.

*Note:* You may find the `.idxmax` method useful in this problem.

In [85]:
powers_fp = os.path.join('data', 'superheroes_powers.csv')
powers = pd.read_csv(powers_fp)

In [87]:
def super_hero_powers(powers):
    df = powers.drop('hero_names',axis=1).applymap(np.count_nonzero)
    count = df.sum(axis=1).values
    greatest = max(count)
    powers['count'] = count
    first_col = powers[powers['count'] == greatest]['hero_names'].values[0]

    most_M = powers[powers['hero_names'].str.match('M')]
    M_noHN = most_M.drop('hero_names',axis = 1).applymap(np.count_nonzero)
    modi_M = M_noHN.sum(axis = 0)
    to_find = sorted(modi_M)[-2]
    
    second_col = modi_M[modi_M == to_find].index[0]

    super_power1 = powers[powers['count'] == 1]
    super_power2 = super_power1.drop('hero_names',axis=1).applymap(np.count_nonzero)
    super_power3 = super_power2.sum(axis=0)
    third_col = super_power3[super_power3 == sorted(super_power3)[-2]].index[0]

    return [first_col,second_col,third_col]

In [89]:
fp = os.path.join('data', 'superheroes_powers.csv')
powers = pd.read_csv(fp)
out = super_hero_powers(powers)
isinstance(out, list)
out

['Spectre', 'Super Strength', 'Intelligence']

### Super-hero attributes

Read in the dataset of super-hero attributes from the `data` directory. Use your summary functions from question 4 to help acquaint yourself with the dataset.

**Question 7**

Cleaning the data: the dataset has no explicit null (`np.NaN`) values, although many entries *should* be null. Replace these values with null by creating a function `clean_heroes`.

Now answer the following questions, collecting your answers in a (function `super_hero_stats` that returns) a list. You should answer the questions using the *cleaned* super-heroes data; your answers *should* be hard-coded in the function.
1. Which publisher has a greater proportion of 'bad' characters -- Marvel Comics or DC Comics?
2. Give the number of characters that are NOT human, or the publisher is not Marvel Comics nor DC comics. For this question, only consider race "Human" as human, races such as "Human / Radiation" don't count as human.
3. Give the name of the character that's both greater than one standard deviation above mean in height and at least one standard deviation below the mean in weight.
4. Who is heavier on average: good or bad characters?
5. What is the name of the tallest Mutant with no hair?
6. What is the probability that a randomly chosen 'Marvel' character in the dataset is a woman?

*Note:* Since your answers to these questions should be hard-coded, you should not include your code in your .py file. Just return a list with your answers.

*Note:* Nan denotes an unknown value that does not count as an entry with any attributes.

In [124]:
herosss = clean_heroes(heroes)
marvel = (herosss['Publisher'].values=='Marvel Comics')
dc = (herosss['Publisher'].values=='DC Comics')
bad = (herosss['Alignment'].values == 'bad')
marvel_bad = pd.Series(marvel & bad).sum()
dc_bad = pd.Series(dc & bad).sum()


59

In [119]:
def clean_heroes(heroes):
    checker = lambda x:np.NaN if x=='-' or x == -99 else x
    return heroes.applymap(checker)
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
out = clean_heroes(heroes)
out['Skin color'].isnull().any()
out

Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,,bad,
5,Absorbing Man,Male,blue,Human,No Hair,193.0,Marvel Comics,,bad,122.0
6,Adam Monroe,Male,blue,,Blond,,NBC - Heroes,,good,
7,Adam Strange,Male,blue,Human,Blond,185.0,DC Comics,,good,88.0
8,Agent 13,Female,blue,,Blond,173.0,Marvel Comics,,good,61.0
9,Agent Bob,Male,brown,Human,Brown,178.0,Marvel Comics,,good,81.0


True

In [112]:
(out['Publisher']=='DC Comics').sum()


215

In [None]:
def super_hero_stats():
    return ['Marvel Comics',558,'Groot','bad','Onslaugh',0.30578512396694213]

### Are blond-haired, blue-eyed characters disproportionately 'good'?

**Question 8**

1. Create a function `bhbe` ('blond-hair-blue-eyes') that returns a boolean column that labels super-heroes/villains that are blond-haired *and* blue eyed.
    * Look at the values of the hair/eyes columns; it needs some cleaning! (The doctest makes sure you've cleaned it properly. If you don't pass the doctest, look more closely at the values in the columns!)


Now, you'd like to answer the question 
> "Are blond-haired, blue-eyed characters disproportionately 'good'?"

To do this, you'd like to test the null hypothesis:
> "The proportion of 'good' heroes among blond-haired, blue-eyed heroes is roughly the same as (equals) the proportion of 'good' heroes in the overall population."

Fix a significance level of 1%.

2. Create a function `observed_stat` that takes in `heroes`, and returns the observed test statistic.
3. Create a function `simulate_bhbe_null` that takes in a number `n` that returns a `n` instances of the test statistic generated under the null hypothesis. You should hard-code your simulation parameter into the function (rounding to the nearest hundredth is fine); the function should *not* read in any data.
4. Create a function `calc_pval` that returns a list where:
    * the first element is the p-value for hypothesis test (using 100,000 simulations). Please run the code yourself and hard-code this answer, as actually running the 100,000 simulation hypothesis test will timeout on gradescope. 
    * the second element is `Reject` if you reject the null hypothesis and `Fail to reject` if you fail to reject the null hypothesis.

In [113]:
def bhbe_col(heroes):
    cleaned = clean_heroes(heroes)
    blond_hair = (cleaned['Hair color'].values=='Blond') | (cleaned['Hair color'].values=='blond')|(cleaned['Hair color'].values=='Strawberry Blond')
    blue_eye = cleaned['Eye color'].values=='blue'
    return pd.Series(blond_hair & blue_eye)


In [126]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
out = bhbe_col(heroes)
isinstance(out, pd.Series)
out.dtype == np.dtype('bool')
out.sum()

93

In [127]:
def observed_stat(heroes):
    heroes['bhbe'] = bhbe_col(heroes)
    observed = heroes[heroes['bhbe'] == True]
    observed_prop = observed[observed['Alignment'] == 'good'].shape[0]/observed.shape[0]
    return observed_prop

In [131]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
out = observed_stat(heroes)
0.5 <= out <= 1.0
out

0.8494623655913979

In [136]:
def simulate_bhbe_null(n):

    result = []
    p = 0.8494623655913979
    for i in range(n):
        stat = np.random.binomial(93,p)/93
        result.append(stat)
    return pd.Series(result)

In [138]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
out = simulate_bhbe_null(10)
isinstance(out, pd.Series)

True

In [139]:
out1 = simulate_bhbe_null(100000)
out1

0        0.892473
1        0.795699
2        0.806452
3        0.849462
4        0.892473
5        0.849462
6        0.860215
7        0.827957
8        0.881720
9        0.806452
10       0.817204
11       0.827957
12       0.838710
13       0.860215
14       0.860215
15       0.870968
16       0.881720
17       0.849462
18       0.795699
19       0.935484
20       0.795699
21       0.806452
22       0.838710
23       0.838710
24       0.806452
25       0.763441
26       0.870968
27       0.838710
28       0.913978
29       0.795699
           ...   
99970    0.903226
99971    0.892473
99972    0.903226
99973    0.860215
99974    0.827957
99975    0.892473
99976    0.774194
99977    0.817204
99978    0.849462
99979    0.903226
99980    0.817204
99981    0.860215
99982    0.838710
99983    0.838710
99984    0.903226
99985    0.838710
99986    0.870968
99987    0.827957
99988    0.838710
99989    0.892473
99990    0.784946
99991    0.881720
99992    0.817204
99993    0.892473
99994    0

In [None]:
p_v = estimate_p_val(1000000)
p_v