list(scores.columns)# DSC 80: Lab 02

### Due Date: Tuesday April 13th, at 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the homework problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab02.py` file, that will be imported into the current notebook.

Homeworks and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).


**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.
- If autograder failed, check to make sure there's no syntax errors with the doctests!

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab02.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import lab02 as lab

In [3]:
import os
import pandas as pd
import numpy as np

## Pandas Basics

---

**Question 1: Test scores**

You will be given a small dataset (so that you can manually check the correctness of your code). Please follow a few requirements when solving the problems below:

* For all questions you need to write code general enough to be applied to another similar dataset. 
* Do not hard-code any answers please. 
* Do not use `for` or `while` loops
   

1. Write a function called `data_load` that takes a file name of the data set to be read as a string and returns a dataframe following the steps below:

    a. Read only a subset of columns: `name`, `tries`, `highest_score`, `sex`
    
    b. Then you realized that for your analysis the column `sex` is not needed. Remove it. 
    
    c. You want to customize the column names: rename `name` to `firstname` and `tries` to `attempts`
    
    d. Turn the `firstname` column into the index.


2. Write a function `pass_fail` that takes the dataframe returned from the function above and adds a column `pass` based on the following conditions:

    * "Yes" if a number of attempts is strictly less than 3 and the score is >= 50
    * "Yes" if a number of attempts is strictly less than 6 and the score is >= 70
    * "Yes" if a number of attempts is strictly less than 10 and the score is >= 90
    * "No" otherwise
 
Your function should return the (modified) input dataframe with the added column.
    
3. Write a fuction `av_score` that takes in a dataframe from the question above and returns the average score for those students who passed the test. 
    
4. Write a function `highest_score_name` that takes in the dataframe from question 1.2 and returns a dictionary, where the key is the highest score and the value is the name (as a list) of the person with the highest score (attempts do not count). If more than one student got the highest score, include all names in a list. 

5. Write a function `idx_dup` that does not take any parameters and returns a single integer, answering the question below:

Is it possible for a dataframe's index to have duplicate values?
1. No, the index values must be unique and uses non-negative integers only, just like in numpy arrays
2. No, the index values must be unique and uses integers only
3. No, the index values must be unique but index values are not restricted to integers
4. Yes, but index values must be non-negative integers only
5. Yes, but index values must be integers only
6. Yes and index values are not restricted to integers
    


In [4]:
scores_fp = os.path.join('data', 'scores.csv')

In [5]:
def data_load(scores_fp):
    """
    follows different steps to create a dataframe
    :param scores_fp: file name as a string
    :return: a dataframe
    >>> scores_fp = os.path.join('data', 'scores.csv')
    >>> scores = data_load(scores_fp)
    >>> isinstance(scores, pd.DataFrame)
    True
    >>> list(scores.columns)
    ['attempts', 'highest_score']
    >>> isinstance(scores.index[0], int)
    False
    """
    # a
    df = pd.read_csv(scores_fp, header=0, usecols=['name','tries','highest_score','sex'])

    # b
    df = df.drop(['sex'],axis=1)

    # c
    df = df.rename(columns={'name':'firstname', 'tries':'attempts'})

    # d
    df = df.set_index('firstname')

    return df

In [6]:
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
scores

Unnamed: 0_level_0,attempts,highest_score
firstname,Unnamed: 1_level_1,Unnamed: 2_level_1
Julia,4,90.0
Angelica,2,70.0
Tyler,2,88.0
Kathleen,7,88.5
Axel,5,45.3
Amiya,2,34.0
Marina,2,100.0
Torrey,14,99.0
Mariah,10,98.1
Grayson,3,67.0


In [7]:
isinstance(scores, pd.DataFrame)

True

In [8]:
list(scores.columns)

['attempts', 'highest_score']

In [9]:
isinstance(scores.index[0], int)

False

In [10]:
def pass_fail_helper(row):
    if row['attempts'] < 3 and row['highest_score'] >= 50:
        return 'Yes'
    elif row['attempts'] < 6 and row['highest_score'] >= 70:
        return 'Yes'
    elif row['attempts'] < 10 and row['highest_score'] >= 90:
        return 'Yes'
    else:
        return 'No'

In [11]:
def pass_fail(scores):
    """
    modifies the scores dataframe by adding one more column satisfying
    conditions from the write up.
    :param scores: dataframe from the question above
    :return: dataframe with additional column pass
    >>> scores_fp = os.path.join('data', 'scores.csv')
    >>> scores = data_load(scores_fp)
    >>> scores = pass_fail(scores)
    >>> isinstance(scores, pd.DataFrame)
    True
    >>> len(scores.columns)
    3
    >>> scores.loc["Julia", "pass"]=='Yes'
    True

    """

    scores['pass'] = scores.apply(lambda row: pass_fail_helper(row), axis=1)
    #scores = scores.insert(loc=2,column='pass',value=newcol)
    return scores

In [12]:
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
scores = pass_fail(scores)
scores

Unnamed: 0_level_0,attempts,highest_score,pass
firstname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Julia,4,90.0,Yes
Angelica,2,70.0,Yes
Tyler,2,88.0,Yes
Kathleen,7,88.5,No
Axel,5,45.3,No
Amiya,2,34.0,No
Marina,2,100.0,Yes
Torrey,14,99.0,No
Mariah,10,98.1,No
Grayson,3,67.0,No


In [13]:
isinstance(scores, pd.DataFrame)

True

In [14]:
type(scores)

pandas.core.frame.DataFrame

In [15]:
len(scores.columns)

3

In [16]:
scores.loc["Julia", "pass"]=='Yes'

True

In [17]:
def av_score(scores):
    """
    returns the average score for those students who passed the test.
    :param scores: dataframe from the second question
    :return: average score
    >>> scores_fp = os.path.join('data', 'scores.csv')
    >>> scores = data_load(scores_fp)
    >>> scores = pass_fail(scores)
    >>> av = av_score(scores)
    >>> isinstance(av, float)
    True
    >>> 91 < av < 92
    True
    """

    return scores[scores['pass'] == 'Yes']['highest_score'].mean()

In [18]:
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
scores = pass_fail(scores)
av = av_score(scores)
av

91.33333333333333

In [19]:
isinstance(av, float)

True

In [20]:
91 < av < 92

True

In [21]:
def highest_score_name(scores):
    """
    finds the highest score and people who received it
    :param scores: dataframe from the second question
    :return: dictionary where the key is the highest score and the value(s) is a list of name(s)
    >>> scores_fp = os.path.join('data', 'scores.csv')
    >>> scores = data_load(scores_fp)
    >>> scores = pass_fail(scores)
    >>> highest = highest_score_name(scores)
    >>> isinstance(highest, dict)
    True
    >>> len(next(iter(highest.items()))[1])
    3
    """
    scores = scores.reset_index()
    maxscore = scores['highest_score'].max()
    top_scorers = scores[scores['highest_score'] == maxscore]
    #return top_scorers#.to_dict()

    names = list(top_scorers.firstname)
    return {maxscore:names}

In [22]:
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
scores = pass_fail(scores)
highest = highest_score_name(scores)
highest

{100.0: ['Marina', 'Marina', 'Marina']}

In [23]:
isinstance(highest, dict)

True

In [24]:
len(next(iter(highest.items()))[1])

3

In [25]:
def idx_dup():
    """
    Answers the question in the write up.
    :return:
    >>> ans = idx_dup()
    >>> isinstance(ans, int)
    True
    >>> 1 <= ans <= 6
    True
    """
    
    return 6

In [26]:
ans = idx_dup()
ans

6

In [27]:
isinstance(ans, int)

True

In [28]:
1 <= ans <= 6

True

## Tricky Pandas.

Sometimes you can get input that you do not expect. The next set of questions walk you through a few examples that might surprise you. 

---
**Question 2 : Duplicate and selection**



1. Write a function `trick_me` that does not take any parameters. <br>Inside the function: 
    * Create a dataframe `tricky_1` that has three columns labeled: "Name", "Name", "Age". Your table should have 5 rows, the values are up to you. 
    * Save this dataframe in the `csv` file called `tricky_1.csv` without the index. 
    * Now create another dataframe, `tricky_2`, by reading in the file `tricky_1.csv `. What are your observations?
        1. It was not possible to create a dataframe with the duplicate columns
        2. `tricky_1` and `tricky_2` have the same column names
        3. `tricky_1` and `tricky_2` have different column names
    * Return your answer as a letter
    
    

2. Write a function `reason_dup` that answers the following question: `Why does pandas allow us to have duplicate column names?` by returning a corresponding letter. 
    1. It does not, duplicate column names are not allowed
    2. Since duplicate indices are allowed and we also can transpose a dataframe.
    3. It is a bug in Pandas
    
    
   
   
3. Write a function `trick_bool` that does not take any parameters. To determine the correct answers from the list below, you should follow the steps outlined by experimenting in *the notebook* (or a python REPL). Outside the function:
    * Create a dataframe `bools` that has four columns labeled: "True", "True", "False", "False". Each column name is boolean.
    * Your table should have 4 rows, the values are up to you. 
    * You need to think (without running it) what output you should get when running each line of code below. Pick a corresponding answer from a given list. Your function should return a list with three letters that correspond to the dataframe structure for each line below. 
    
     ```
     df[True]
     df[[True, True, False, False]]
     df[[True, False]]
     ```
    
        1. Dataframe: 2 columns, 1 row
        2. Dataframe: 2 columns, 2 rows
        3. Dataframe: 2 columns, 3 rows
        4. Dataframe: 2 columns, 4 rows
        5. Dataframe: 3 columns, 1 rows
        6. Dataframe: 3 columns, 2 rows
        7. Dataframe: 3 columns, 3 rows
        8. Dataframe: 3 columns, 4 rows
        9. Dataframe: 4 columns, 1 rows
        10. Dataframe: 4 columns, 2 rows
        11. Dataframe: 4 columns, 3 rows
        12. Dataframe: 4 columns, 4 rows
        13. Error
    
    
4.  Write a function `reason_bool` that answers the following question: `Why the outputs are the way they are?` by returning a corresponding letter. 
    1. booleans arrays select either rows or columns, randomly
    2. booleans arrays always select rows by default
    3. booleans arrays always select columns by default 
    4. booleans arrays always select rows by default, unless column names are set to `True`/`False` values.
    
    
    
   


In [29]:
pd.DataFrame({'Name':['Z','Y','X','W','V'], 'Name':['A','B','C','D','E'] , 'Age':[1,2,3,4,5]})

Unnamed: 0,Name,Age
0,A,1
1,B,2
2,C,3
3,D,4
4,E,5


In [30]:
data = [
    ['a', 'z', 1], # row 1
    ['b', 'y', 2], # row 2
    ['c', 'x', 3], # row 3
    ['d', 'w', 4],  # row 4
    ['e', 'v', 5]  # row 5
]
df = pd.DataFrame(data,                                     # rows of dataframe
                   columns = ['name', 'name', 'age'])       # column names 
df

Unnamed: 0,name,name.1,age
0,a,z,1
1,b,y,2
2,c,x,3
3,d,w,4
4,e,v,5


In [31]:
csv_fp = os.path.join('data', 'tricky1.csv')
df.to_csv(csv_fp)
df2 = pd.read_csv(csv_fp)
df2

Unnamed: 0.1,Unnamed: 0,name,name.1,age
0,0,a,z,1
1,1,b,y,2
2,2,c,x,3
3,3,d,w,4
4,4,e,v,5


In [32]:
def trick_me():
    """
    Answers the question in the write-up
    :return: a letter
    >>> ans =  trick_me()
    >>> ans == 'A' or ans == 'B' or ans == "C"
    True
    """
    return 'C'

In [33]:
def reason_dup():
    """
     Answers the question in the write-up
    :return: a letter
    >>> ans =  reason_dup()
    >>> ans == 'A' or ans == 'B' or ans == "C"
    True
    """
    return 'B'

In [34]:
data = [['a', 'z', 1, 10],['b', 'y', 2, 9],['c', 'x', 3, 8],['d', 'w', 4, 7]]
df = pd.DataFrame(data,columns = [True, True, False, False])
df

Unnamed: 0,True,True.1,False,False.1
0,a,z,1,10
1,b,y,2,9
2,c,x,3,8
3,d,w,4,7


In [35]:
df[True]

Unnamed: 0,True,True.1
0,a,z
1,b,y
2,c,x
3,d,w


In [36]:
df[[True, True, False, False]]

Unnamed: 0,True,True.1,False,False.1
0,a,z,1,10
1,b,y,2,9


In [37]:
# df[[True, False]]

In [38]:
def trick_bool():
    """
     Answers the question in the write-up
    :return: a list with three letters
    >>> ans =  trick_bool()
    >>> isinstance(ans, list)
    True
    >>> isinstance(ans[1], str)
    True
    """
    return ['D','J','M']

In [39]:
ans =  trick_bool()

In [40]:
isinstance(ans[1], str)

True

In [41]:
def reason_bool():
    """
    Answers the question in the write-up
    :return: a letter
    >>> ans =  reason_bool()
    >>> ans == 'A' or ans == 'B' or ans == "C" or ans =="D"
    True

    """
    return 'D'

In [42]:
ans =  reason_bool()
ans == 'A' or ans == 'B' or ans == "C" or ans =="D"

True

---
**Question 3 : np.NaN in a dataframe**


In the notebook, use the code given below to create a dataframe called `nans`. Note that we use `np.NaN` (`numpy`'s representation of 'Not a Number') to create missing values.
 
```
nans = pd.DataFrame([[0,1,np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
```
Now you decided to make your dataset more readable for people who do not understand `NaN` and replace it with a `MISSING` string instead. In order to do that you wrote the following function:

```
def change(x):
    if x == np.NaN:
        return "MISSING"
    else:
        return x
```

* Write a line of code that applies the function above to the last column of the `nans` dataframe. 
* What was a result?
    * A: It worked: all np.NaNs in the last columns where changed to "MISSING"
    * B: It did not work: does not matter how I tried, the NaN values were not changed.
    
I expect you to answer `B` here. What had happened? Turns out, you can't use simple comparison `==` to detect if a value is `np.NaN`. You need to use another way to compare a variable to a `np.NaN`, read about it [here](https://stackoverflow.com/questions/41342609/the-difference-between-comparison-to-np-nan-and-isnull)

1. Modify the function `change` above to work as expected.
2. Write method `correct_replacement` that takes in a dataframe like `nans` and returns a modified dataframe, where all the `NaN` are replaced with `"MISSING"`. Use your corrected version of `change` to do this. **The pandas function .fillna is not allowed in this question.** 


In [43]:
nans = pd.DataFrame([[0,1,np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])

In [44]:
def change(x):
    """
    Returns 'MISSING' when x is `NaN`,
    Otherwise returns x
    >>> change(1.0) == 1.0
    True
    >>> change(np.NaN) == 'MISSING'
    True
    """
    if np.isnan(x) == True:
        return "MISSING"
    else:
        return x

In [45]:
change(1.0) == 1.0

True

In [46]:
change(np.NaN) == 'MISSING'

True

In [47]:
nans[2].apply(change)

0    MISSING
1    MISSING
2        3.0
Name: 2, dtype: object

In [48]:
def correct_replacement(nans):
    """
    changes all np.NaNs to "Missing"
    :param nans: given dataframe
    :return: modified dataframe
    >>> nans = pd.DataFrame([[0,1,np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
    >>> A = correct_replacement(nans)
    >>> (A.values == 'MISSING').sum() == 4
    True
    """
    for col in nans.columns:
        nans[col] = nans[col].apply(change)
    return nans

In [49]:
nans = pd.DataFrame([[0,1,np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
nans

Unnamed: 0,0,1,2
0,0.0,1.0,
1,,,
2,1.0,2.0,3.0


In [50]:
A = correct_replacement(nans)
A

Unnamed: 0,0,1,2
0,0.0,1.0,MISSING
1,MISSING,MISSING,MISSING
2,1.0,2.0,3.0


In [51]:
(A.values == 'MISSING').sum() == 4

True

---

### Summary Statistics

**Question 4**

In this question you will create two general purpose functions that make it easy to 'qualitatively' assess the contents of a dataframe.

1. Create a function `population_stats` which takes in a dataframe `df` and returns a dataframe indexed by the columns of `df`, with the following columns:
    * `num_nonnull` contains the number of non-null entries in each column,
    * `pct_nonnull` contains the proportion of entries in each column that are non-null,
    * `num_distinct` contains the number of distinct entries in each column,
    * `pct_distinct` contains the proportion of (non-null) entries in each column that are distinct from each other.
    
*Note*: you may find the `.nunique()` series method useful.

*Note*: The number of distinct entries does not include nulls.
    
2. Create a function `most_common` which takes in a dataframe `df` and a number `N` and returns a dataframe of the `N` most-common values (and their counts) for each column of `df`. Any column with fewer than `N` distinct values should contain `NaN` in those entries. 

*Note*: you can loop through the *columns* of `df` to construct your output. You should **not** be looping through rows.

For example, for the subset of the `salaries` dataframe with columns 'Job Title' and 'Status' from lecture one (left), `most_common(salaries, N=5)` is given (right). 

<table><tr>
    <td><img src="data/imgs/dataframe.png" width="70%"/></td>
    <td><img src="data/imgs/most_common.png" width="70%"/></td>
</tr></table>

In [52]:
def population_stats(df):
    """
    population_stats which takes in a dataframe df
    and returns a dataframe indexed by the columns
    of df, with the following columns:
        - `num_nonnull` contains the number of non-null
          entries in each column,
        - `pct_nonnull` contains the proportion of entries
          in each column that are non-null,
        - `num_distinct` contains the number of distinct
          entries in each column,
        - `pct_distinct` contains the proportion of (non-null)
          entries in each column that are distinct from each other.

    :Example:
    >>> data = np.random.choice(range(10), size=(100, 4))
    >>> df = pd.DataFrame(data, columns='A B C D'.split())
    >>> out = population_stats(df)
    >>> out.index.tolist() == ['A', 'B', 'C', 'D']
    True
    >>> cols = ['num_nonnull', 'pct_nonnull', 'num_distinct', 'pct_distinct']
    >>> out.columns.tolist() == cols
    True
    >>> (out['num_distinct'] <= 10).all()
    True
    >>> (out['pct_nonnull'] == 1.0).all()
    True
    """
    num1 = df.count()
    pct1 = num1 / len(df.index)
    num2 = df.nunique() 
    pct2 = num2 / num1
    return pd.DataFrame({"num_nonnull":num1, "pct_nonnull":pct1, "num_distinct":num2, "pct_distinct":pct2},index=df.columns)

In [53]:
data = np.random.choice(range(10), size=(100, 4))
df = pd.DataFrame(data, columns='A B C D'.split())
out = population_stats(df)
out

Unnamed: 0,num_nonnull,pct_nonnull,num_distinct,pct_distinct
A,100,1.0,10,0.1
B,100,1.0,10,0.1
C,100,1.0,10,0.1
D,100,1.0,10,0.1


In [54]:
out.index.tolist() == ['A', 'B', 'C', 'D']

True

In [55]:
cols = ['num_nonnull', 'pct_nonnull', 'num_distinct', 'pct_distinct']
out.columns.tolist() == cols

True

In [56]:
(out['num_distinct'] <= 10).all()

True

In [57]:
(out['pct_nonnull'] == 1.0).all()

True

In [132]:
def most_common(df, N=10):
    """
    `most_common` which takes in a dataframe df and returns
    a dataframe of the N most-common values (and their counts)
    for each column of df.

    :param df: input dataframe.
    :param N: number of most common elements to return (default 10)
.
    :Example:
    >>> data = np.random.choice(range(10), size=(100, 2))
    >>> df = pd.DataFrame(data, columns='A B'.split())
    >>> out = most_common(df, N=3)
    >>> out.index.tolist() == [0, 1, 2]
    True
    >>> out.columns.tolist() == ['A_values', 'A_counts', 'B_values', 'B_counts']
    True
    >>> out['A_values'].isin(range(10)).all()
    True
    """
    df_dict = {}
    for col in df.columns:
        col_df = df[col].value_counts()[:N].reset_index()
        
        val_format = '{}_values'.format(col)
        val = col_df.index
        df_dict[val_format] = val
        
        count_format = '{}_counts'.format(col)
        count = col_df[col]
        df_dict[count_format] = count
    return pd.DataFrame(df_dict)

In [133]:
data = np.random.choice(range(10), size=(100, 2))
df = pd.DataFrame(data, columns='A B'.split())
out = most_common(df, N=3)
out

Unnamed: 0,A_values,A_counts,B_values,B_counts
0,0,16,0,15
1,1,14,1,14
2,2,12,2,13


In [60]:
# data = np.random.choice(range(10), size=(100, 2))
# df = pd.DataFrame(data, columns='A B'.split())
# df = df.A.value_counts()[:3].reset_index()
# df

Unnamed: 0,index,A
0,3,17
1,7,17
2,5,12


In [61]:
out.index.tolist() == [0, 1, 2]

True

In [62]:
out.columns.tolist() == ['A_values', 'A_counts', 'B_values', 'B_counts']

True

In [63]:
out['A_values'].isin(range(10)).all()

True

## Faulty Scooters

**Question 5**

A new electric scooter company 'Maxwell Scooters' opened a retail shop in La Jolla recently and 300 UCSD students bought new scooters for getting around campus. After 8 students start complaining their scooters are faulty, negative on-line reviews for the scooters start to spread. In response, the scooter company adamantly claims that 99% of their scooters come off the production line working properly. You think this seems unlikely and decide to investigate.

* Select a significance level for you investigation. (Not to be turned in)
* What are reasonable choices for the *Null Hypothesis* for your investigation? Select all that apply:
    1. The scooter company produces scooters that are 99% non-faulty.
    2. The scooter company produces scooters that are less than 99% non-faulty.
    3. The scooter company produces scooters that are at least 1% faulty.
    4. The scooter company produces scooters that are ~2.6% faulty.

Return your answer in a function `null_hypoth` that takes zero arguments.

* Create a function `simulate_null` simulates a single step of data generation under the null hypothesis. The function should return a binary array.

* Create a function `estimate_p_val` that takes in a number `N` and returns the estimated p-value of your investigation upon simulating the null hypothesis `N` times.

*Note*: Plot the Null distribution and your observed statistic to check your work.

In [64]:
def null_hypoth():
    """
    :Example:
    >>> isinstance(null_hypoth(), list)
    True
    >>> set(null_hypoth()).issubset({1,2,3,4})
    True
    """
    return [1]

In [65]:
isinstance(null_hypoth(), list)

True

In [66]:
set(null_hypoth()).issubset({1,2,3,4})

True

In [67]:
def simulate_null():
    """
    :Example:
    >>> pd.Series(simulate_null()).isin([0,1]).all()
    True
    """
    return np.random.choice([0, 1], size=300, p=[.01, .99])

In [68]:
pd.Series(simulate_null()).isin([0,1]).all()

True

In [69]:
def estimate_p_val(N):
    """
    >>> 0 < estimate_p_val(1000) < 0.1
    True
    """
    count = 0
    for run in range(N):
        sim = simulate_null()
        if sim.tolist().count(0) >= 8:
            count += 1
    return count / N

In [70]:
estimate_p_val(1000)

0.009

In [71]:
0 < estimate_p_val(1000) < 0.1

True

# Super-Heroes

The questions below analyze a dataset of super-heroes found in the `data` directory. One of the datasets have a list of attributes on each super-hero, while the other is a *boolean* dataframe of which super-heroes have which super-powers. Note, the datasets contain information on both *good* super-heroes, as well as *bad* super-heroes (AKA villains). 

### Super-hero powers

**Question 6**

Now read in the dataset of super-hero powers in the `data` directory. Create a function `super_hero_powers` that takes in a dataframe like `powers` and returns a list with the following three entries:

1. The name of the super-hero with the greatest number of powers.
2. The name of the most common super-power among super-heroes whose names begin with 'M'.
3. The most popular super-power among those with only one super-power.

You should *not* be hard-coding your answers in this question; your function should work on any dataset similar to `powers`. You should not be using loops in this question.

*Note:* You may find the `.idxmax` method useful in this problem.

In [72]:
powers_fp = os.path.join('data', 'superheroes_powers.csv')
powers = pd.read_csv(powers_fp)

In [73]:
def super_hero_powers(powers):
    """
    `super_hero_powers` takes in a dataframe like
    powers and returns a list with the following three entries:
        - The name of the super-hero with the greatest number of powers.
        - The name of the most common super-power among super-heroes whose names begin with 'M'.
        - The most popular super-power among those with only one super-power.

    :Example:
    >>> fp = os.path.join('data', 'superheroes_powers.csv')
    >>> powers = pd.read_csv(fp)
    >>> out = super_hero_powers(powers)
    >>> isinstance(out, list)
    True
    >>> len(out)
    3
    >>> all([isinstance(x, str) for x in out])
    True
    """
    
    op_hero_ind = powers[powers==True].count(axis=1).nlargest(5).index[0]
    op_hero = powers.iloc[op_hero_ind].values[0]
    
    m_df = powers.loc[powers.hero_names.str[0] == 'M']
    m_most_common = m_df[m_df==True].count().nlargest(1).index[0]
    
    # filter for those with only 1 true then run same code as heroes whose names begins with m
    #df_index_reset = powers.reset_index()
    tf_df = powers.drop('hero_names', axis=1)
    one_power = tf_df.loc[tf_df.sum(axis=1) == 1]
    common_one_power = one_power[one_power==True].count().nlargest(1).index[0]
    
    return [op_hero, m_most_common, common_one_power]

In [74]:
fp = os.path.join('data', 'superheroes_powers.csv')
powers = pd.read_csv(fp)
out = super_hero_powers(powers)
out

['Spectre', 'Super Strength', 'Intelligence']

In [75]:
# op_hero_ind = powers[powers==True].count(axis=1).nlargest(5).index[0]
# powers.iloc[op_hero_ind].values[0]

In [76]:
# m_df = powers.loc[powers.hero_names.str[0] == 'M']
# m_df[m_df==True].count().nlargest(1).index[0]

In [77]:
# tf_df = powers.drop('hero_names', axis=1)
# one_power = tf_df.loc[tf_df.sum(axis=1) == 1]
# common_one_power = one_power[one_power==True].count().nlargest(1).index[0]

In [78]:
isinstance(out, list)

True

In [79]:
len(out)

3

In [80]:
all([isinstance(x, str) for x in out])

True

### Super-hero attributes

Read in the dataset of super-hero attributes from the `data` directory. Use your summary functions from question 4 to help acquaint yourself with the dataset.

**Question 7**

Cleaning the data: the dataset has no explicit null (`np.NaN`) values, although many entries *should* be null. Replace these values with null by creating a function `clean_heroes`.

Now answer the following questions, collecting your answers in a (function `super_hero_stats` that returns) a list. You should answer the questions using the *cleaned* super-heroes data; your answers *should* be hard-coded in the function.
1. Which publisher has a greater proportion of 'bad' characters -- Marvel Comics or DC Comics?
2. Give the number of characters that are NOT human, or the publisher is not Marvel Comics nor DC comics. For this question, only consider race "Human" as human, races such as "Human / Radiation" don't count as human.
3. Give the name of the character that's both greater than one standard deviation above mean in height and at least one standard deviation below the mean in weight.
4. Who is heavier on average: good or bad characters?
5. What is the name of the tallest Mutant with no hair?
6. What is the probability that a randomly chosen 'Marvel' character in the dataset is a woman?

*Note:* Since your answers to these questions should be hard-coded, you should not include your code in your .py file. Just return a list with your answers.

*Note:* Nan denotes an unknown value that does not count as an entry with any attributes.

In [81]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)

In [82]:
heroes

Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0
...,...,...,...,...,...,...,...,...,...,...
729,Yellowjacket II,Female,blue,Human,Strawberry Blond,165.0,Marvel Comics,-,good,52.0
730,Ymir,Male,white,Frost Giant,No Hair,304.8,Marvel Comics,white,good,-99.0
731,Yoda,Male,brown,Yoda's species,White,66.0,George Lucas,green,good,17.0
732,Zatanna,Female,blue,Human,Black,170.0,DC Comics,-,good,57.0


In [83]:
# heroes['Skin color'] = heroes['Skin color'].replace('-', np.NaN)
# heroes
type(heroes['Weight'].values[4])# = np.NaN
#heroes

numpy.float64

In [84]:
heroes['Hair color'].unique()

array(['No Hair', 'Black', 'Blond', 'Brown', '-', 'White', 'Purple',
       'Orange', 'Pink', 'Red', 'Auburn', 'Strawberry Blond', 'black',
       'Blue', 'Green', 'Magenta', 'Brown / Black', 'Brown / White',
       'blond', 'Silver', 'Red / Grey', 'Grey', 'Orange / White',
       'Yellow', 'Brownn', 'Gold', 'Red / Orange', 'Indigo',
       'Red / White', 'Black / Blue'], dtype=object)

In [85]:
heroes['Eye color'].unique()

array(['yellow', 'blue', 'green', 'brown', '-', 'red', 'violet', 'white',
       'purple', 'black', 'grey', 'silver', 'yellow / red',
       'yellow (without irises)', 'gold', 'blue / white', 'hazel',
       'green / blue', 'white / red', 'indigo', 'amber', 'yellow / blue',
       'bown'], dtype=object)

In [86]:
def clean_heroes(heroes):
    """
    clean_heroes takes in the dataframe heroes
    and replaces values that are 'null-value'
    place-holders with np.NaN.

    :Example:
    >>> superheroes_fp = os.path.join('data', 'superheroes.csv')
    >>> heroes = pd.read_csv(superheroes_fp, index_col=0)
    >>> out = clean_heroes(heroes)
    >>> out['Skin color'].isnull().any()
    True
    >>> out['Weight'].isnull().any()
    True
    """
    num = heroes._get_numeric_data()
    num[num < 0] = np.NaN
    num[num < 0] = np.NaN
    heroes = heroes.replace('-', np.NaN)
    
    heroes['Hair color'] = heroes['Hair color'].replace(['Strawberry Blond', 'blond'], 'Blond') 
    heroes['Hair color'] = heroes['Hair color'].replace('Brownn', 'Brown')
    heroes['Hair color'] = heroes['Hair color'].replace('black', 'Black')

    return heroes

In [135]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
out = clean_heroes(heroes)
out 

Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,,bad,
...,...,...,...,...,...,...,...,...,...,...
729,Yellowjacket II,Female,blue,Human,Blond,165.0,Marvel Comics,,good,52.0
730,Ymir,Male,white,Frost Giant,No Hair,304.8,Marvel Comics,white,good,
731,Yoda,Male,brown,Yoda's species,White,66.0,George Lucas,green,good,17.0
732,Zatanna,Female,blue,Human,Black,170.0,DC Comics,,good,57.0


In [88]:
out['Skin color'].isnull().any()

True

In [89]:
out['Weight'].isnull().any()

True

In [90]:
marvel = out.loc[out.Publisher == 'Marvel Comics']
marvel_prop = marvel.Alignment.value_counts().values[0] / marvel.Alignment.value_counts().values.sum()
print(marvel_prop)
dc = out.loc[out.Publisher == 'DC Comics']
dc_prop = dc.Alignment.value_counts().values[0] / dc.Alignment.value_counts().values.sum()
print(dc_prop)

0.6727272727272727
0.6635514018691588


In [136]:
not_human = out.loc[out.Race != 'Human']
print(not_human.shape[0])

non_marvel = out.loc[out.Publisher != 'Marvel Comics']
non_marveldc = non_marvel.loc[non_marvel.Publisher != 'DC Comics']
print(non_marveldc.shape[0])

overlap = non_marveldc.loc[non_marveldc.Race != 'Human']
print(overlap.shape[0])

526
131
99


In [92]:
height_cutoff = out.Height.mean() + out.Height.std()
weight_cutoff = out.Weight.mean() - out.Weight.std()
tall = out.loc[out.Height >= height_cutoff]
mystery = tall.loc[tall.Weight <= weight_cutoff]
mystery.name.values[0]

'Groot'

In [93]:
goodw = out.loc[out.Alignment == 'good'].Weight.mean()
print(goodw)
badw = out.loc[out.Alignment == 'bad'].Weight.mean()
print(badw)

95.54654654654655
139.80985915492957


In [94]:
mutants = out.loc[out.Race == 'Mutant']
no_hair_mutants = mutants.loc[mutants['Hair color'] == 'No Hair']
no_hair_mutants.sort_values(by=['Height'], ascending=False).iloc[0][0]

'Onslaught'

In [95]:
marvel = out.loc[out.Publisher == 'Marvel Comics']
marvel.loc[out.Gender == 'Female'].shape[0] / marvel.shape[0]

0.2860824742268041

In [96]:
def super_hero_stats():
    """
    Returns a list that answers the questions in the notebook.
    :Example:
    >>> out = super_hero_stats()
    >>> out[0] in ['Marvel Comics', 'DC Comics']
    True
    >>> isinstance(out[1], int)
    True
    >>> isinstance(out[2], str)
    True
    >>> out[3] in ['good', 'bad']
    True
    >>> isinstance(out[4], str)
    True
    >>> 0 <= out[5] <= 1
    True
    """

    return ['Marvel Comics',558,'Groot','bad','Onslaught',0.2860824742268041]

In [97]:
out = super_hero_stats()
out

['Marvel Comics', 175, 'Groot', 'bad', 'Onslaught', 0.3018867924528302]

In [98]:
out[0] in ['Marvel Comics', 'DC Comics']

True

In [99]:
isinstance(out[1], int)

True

In [100]:
isinstance(out[2], str)

True

In [101]:
out[3] in ['good', 'bad']

True

In [102]:
isinstance(out[4], str)

True

In [103]:
0 <= out[5] <= 1

True

### Are blond-haired, blue-eyed characters disproportionately 'good'?

**Question 8**

1. Create a function `bhbe` ('blond-hair-blue-eyes') that returns a boolean column that labels super-heroes/villains that are blond-haired *and* blue eyed.
    * Look at the values of the hair/eyes columns; it needs some cleaning! (The doctest makes sure you've cleaned it properly. If you don't pass the doctest, look more closely at the values in the columns!)


Now, you'd like to answer the question 
> "Are blond-haired, blue-eyed characters disproportionately 'good'?"

To do this, you'd like to test the null hypothesis:
> "The proportion of 'good' heroes among blond-haired, blue-eyed heroes is roughly the same as (equals) the proportion of 'good' heroes in the overall population."

Fix a significance level of 1%.

2. Create a function `observed_stat` that takes in `heroes`, and returns the observed test statistic.
3. Create a function `simulate_bhbe_null` that takes in a number `n` that returns a `n` instances of the test statistic generated under the null hypothesis. You should hard-code your simulation parameter into the function (rounding to the nearest hundredth is fine); the function should *not* read in any data.
4. Create a function `calc_pval` that returns a list where:
    * the first element is the p-value for hypothesis test (using 100,000 simulations). Please run the code yourself and hard-code this answer, as actually running the 100,000 simulation hypothesis test will timeout on gradescope. 
    * the second element is `Reject` if you reject the null hypothesis and `Fail to reject` if you fail to reject the null hypothesis.

In [104]:
def bhbe_col(heroes):
    """
    `bhbe` ('blond-hair-blue-eyes') returns a boolean
    column that labels super-heroes/villains that
    are blond-haired *and* blue eyed.

    :Example:
    >>> superheroes_fp = os.path.join('data', 'superheroes.csv')
    >>> heroes = pd.read_csv(superheroes_fp, index_col=0)
    >>> out = bhbe_col(heroes)
    >>> isinstance(out, pd.Series)
    True
    >>> out.dtype == np.dtype('bool')
    True
    >>> out.sum()
    93
    """
    #     bh = cleaned[cleaned['Hair color'] == 'Blond']
    #     bhbe = bh[bh['Eye color'] == 'blue'].index

    cleaned = clean_heroes(heroes)
    cleaned['newcol'] = (cleaned['Hair color'] == 'Blond') & (cleaned['Eye color'] == 'blue')
    
    return cleaned['newcol']

In [105]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
out = bhbe_col(heroes)
out

0      False
1      False
2      False
3      False
4      False
       ...  
729     True
730    False
731    False
732    False
733    False
Name: newcol, Length: 734, dtype: bool

In [106]:
isinstance(out, pd.Series)

True

In [107]:
out.dtype == np.dtype('bool')

True

In [108]:
out.sum()

93

In [109]:
cleaned = clean_heroes(heroes)
cleaned['newcol'] = (cleaned['Hair color'] == 'Blond') & (cleaned['Eye color'] == 'blue')
    
bhbe = cleaned.loc[cleaned.newcol == True]
bhbe_size = bhbe.loc[bhbe.Alignment == 'good'].shape[0]# / bhbe.shape[0]
bhbe_size

79

In [110]:
def observed_stat(heroes):
    """
    observed_stat returns the observed test statistic
    for the hypothesis test.

    :Example:
    >>> superheroes_fp = os.path.join('data', 'superheroes.csv')
    >>> heroes = pd.read_csv(superheroes_fp, index_col=0)
    >>> out = observed_stat(heroes)
    >>> 0.5 <= out <= 1.0
    True
    """
    cleaned = clean_heroes(heroes)
    cleaned['newcol'] = (cleaned['Hair color'] == 'Blond') & (cleaned['Eye color'] == 'blue')
    
    bhbe = cleaned.loc[cleaned.newcol == True]
    bhbe_prop = bhbe.loc[bhbe.Alignment == 'good'].shape[0] / bhbe.shape[0]
    
    return bhbe_prop

In [111]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
out = observed_stat(heroes)
out

0.8494623655913979

In [112]:
0.5 <= out <= 1.0

True

In [126]:
def simulate_bhbe_null(n):
    """
    `simulate_bhbe_null` that takes in a number `n`
    that returns a `n` instances of the test statistic
    generated under the null hypothesis.
    You should hard code your simulation parameter
    into the function; the function should *not* read in any data.

    :Example:
    >>> superheroes_fp = os.path.join('data', 'superheroes.csv')
    >>> heroes = pd.read_csv(superheroes_fp, index_col=0)
    >>> out = simulate_bhbe_null(10)
    >>> isinstance(out, pd.Series)
    True
    >>> out.shape[0]
    10
    >>> ((0.45 <= out) & (out <= 1)).all()
    True
    """
    simulations = pd.DataFrame(np.random.choice([1,0],p=[.68,.32], size=(n,734)))
    test_stats = simulations.sum(axis=1) / 734
    
    return test_stats

In [127]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
out = simulate_bhbe_null(10)
out

0    0.677112
1    0.670300
2    0.668937
3    0.697548
4    0.698910
5    0.659401
6    0.705722
7    0.698910
8    0.670300
9    0.682561
dtype: float64

In [128]:
isinstance(out, pd.Series)

True

In [129]:
out.shape[0]

10

In [130]:
((0.45 <= out) & (out <= 1)).all()

True

In [131]:
obs = observed_stat(heroes)
simulations = simulate_bhbe_null(100000)
(simulations >= obs).mean()

0.0

In [None]:
def calc_pval():
    """
    calc_pval returns a list where:
        - the first element is the p-value for
        hypothesis test (using 100,000 simulations).
        - the second element is Reject if you reject
        the null hypothesis and Fail to reject if you
        fail to reject the null hypothesis.

    :Example:
    >>> out = calc_pval()
    >>> len(out)
    2
    >>> 0 <= out[0] <= 1
    True
    >>> out[1] in ['Reject', 'Fail to reject']
    True
    """
    
    #bhbe = bhbe_col(heroes).astype(int)
    #(simulations >= bhbe).mean()
    return [0,'Reject']

In [None]:
out = calc_pval()

In [None]:
len(out)

In [None]:
0 <= out[0] <= 1

In [None]:
out[1] in ['Reject', 'Fail to reject']