## Outline

- Permutation testing - Applying our 'simulation method' to differences between means
- Data concepts: Permutation testing -
    - Mosquitoes and beer - Roughly, the question is:  Is the mean number of mosquitoes attracted to beer and water drinkers different?
    - Tests of mean differences are common (t-tests in traditional statistics)
    - The less simplified version - includes issues of data wrangling and research design
    - If you look at <a href="mosquitos_and_beer_article.pdf">the original paper</a> you'll see that the stats are different than what you get in the example from the textbook (or video)
    - That is because the textbook/video example uses a simplified measure (for good reason)
    - We will use a real measure and that is where data wrangling comes in
    - Our value will be closer to the paper's value
- Python: merging dataframes -  this is an important method for combining data from different sources

## Permutation test: Applying our simulation method to differences between means
## -Mosquitoes and beer-
## Are you more attractive to malaria mosquitoes after a beer? 
#### (Remember this? - see video from the beginning of class)

Mosquitoes and beer experiment

Design:
- Put one person in a tent to capture odor from breath and body
- Open the other tent to the outside air
- Draw air from these two sources into boxes at the end of a Y-junction
- Release mosquitos
- Do they go left to the box connected to the person?
- Do they go right to the box connected to outside air?

# Experimental setup

<img src="AnnotatedExpSetup.png"/>

### Read in the data

In [1]:
# The data are in the file: mosquito_beer.csv - read the data into a dataframe

In [None]:
# get information about the columns of the dataframe

In [None]:
# look at some information at the top of the file

#### Information about the dataframe:
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   volunteer     86 non-null     object 
 1   group         86 non-null     object 
 2   test          86 non-null     object 
 3   nb_released   86 non-null     int64  
 4   no_odour      86 non-null     int64  
 5   volunt_odour  86 non-null     int64  
 6   activated     86 non-null     int64  
 7   co2no         81 non-null     float64
 8   co2od         80 non-null     float64
 9   temp          82 non-null     float64
 10  trapside      86 non-null     object 
 11  datetime      86 non-null     object 
dtypes: float64(3), int64(4), object(5)
memory usage: 8.2+ KB
```

#### Top of the dataframe:
<img src=mosq_beer_df_header.png>

#### Can you see how to link the design to the data?

- What is in the 'group' column?
- What is in the 'test' column?  (why do they have this?)
- What is in 'nb_released'?
- What is 'no_odour'?
- What is 'volunt_odour'?
- What is 'activated'?

### A clarification on the design

<img src=beer_water_exp_design.png align="center">

Note that the full experiment is a 'mixed design' -- do you see why?  

Pay attention to:
- before/after 
- water/beer 

#### We need a single value that measures the beer/water difference
- What will you use?
- Think about what data you have
  - Total mosquitoes released
  - Activated mosquitoes
  - 'No odor' mosquitoes
  - 'Volunteer odor' mosquitoes
  - before/after -- What is this about?  How will you use this information?
- Conceptually (first) what might tell you if mosquitoes liked beer volunteers more than water volunteers

### Get your ideas about how to do this conceptually clear first
Then work on the python implementation

### Python implementation

### Hint:  This is going to involve splitting the data frame, putting it back together and taking some differences

- Get the steps you want to do clear (not in actual python code, just understand the steps)
- Translate into python code
- Remember McDonalds and Starbucks?
- We split the dataframe into the different restaurants
- We will do something similar here
- Then we will put the parts back together differently so we can subtract after - before (hint, hint)

#### Parts of this are hard.  If they aren't obvious at first, that is normal

### What are your steps?

Write your steps here

## Python moment:  How to put two dataframes together based on shared information

- Both dataframes must have at least one column that has the same information
  - e.g. both have information about participants
- the information is different
- we want one new dataframe that puts **all** information about each participant on a single row
  - e.g. one dataframe has ppt heights
  - another dataframe has ppt weights
  - We want a one dataframe with ppt, height, weight


# MERGE

This is a basic data wrangling operation - puts two dataframes together aligning them on one or more columns
- We want to align based on the 'volunteer' column so data from the same person stays together
<br>

### Merge arguments and what they mean:

The merge statement:
```python
water_data = pd.merge(before_water_data,
                      after_water_data,
                      how='outer',
                      on=['volunteer'],
                      suffixes=('_before','_after'))
```

Arguments:
- The first two arguments are the names of the datafames to join: 
    - before_water_data 
    - after_water_data

<img src="join_types.png" align="right" width=300>

- 'how' specifies the way the dataframes are joined.  There are different types:
    - 'outer' - keep rows from both dataframes even if rows are missing
        - e.g. if 'after' had a participant missing we would still keep the 'before' entry
    - 'inner' - keep rows that have entries in both dataframes
        - anything missing is removed
    - 'left' - keep the rows from 'before_water_data'
        - anything from 'after' that is not in 'before' is removed
    - 'right' - keep the rows from 'after_water_data'
        - anything from 'before' that is not in 'after' is removed  
<br>  


- 'on' says which column to use to align the dataframes
    - e.g. we want to keep the values from the same person aligned  
<br>  

- 'suffixes=('_before','_after')' adds the suffixes to columns so we know which dataframe columns came from  
  - When both dataframes have some columns with the same name
    - e.g. if both dataframes have an 'activated' column, the joined dataframe will have
      - 'activated_before'
      - 'activated_after'
    - there can only be two entries here, one for each dataframe

### At the end of this part, we have one dataframe with data from 'beer' participants and 'water' participants, and each participant has a value that tells us how much the mosquitoes liked them.

# Testing for a difference

### Now we need to test if beer/water made a difference

- Now we apply our simulation logic:
  - Decide how many beer/water experiments to simulate (a low number first, then increase when your code is checked)
  - Save an array with space to store results
  - Make a loop
    - simulate one experiment
    - Here we do this by mixing up one column
      - (hint: 'permutation' means mixing up the outcomes -- What will you mix?)
    - Hint: You mix up the conditions you are testing (e.g. beer/water) -- make sure you understand why
      - If they don't matter, then mixing will not change the result much
      - If they **do** matter than the simulated world will look very different from the real world
    - We are simulating a world where beer and water are the same
    - Calculate the difference between the beer and water conditions
    - Store that result
    - repeat
- Compare the beer/water difference from the simulated world with the real world outcome
  - Plot a distribution of the simulated beer/water differences
  - Plot the real word difference
  - Calculate a probability -- How likely was the real world difference in the simulated world?

###  What are your steps?

### Translate your conceptual steps into Python steps

#### The first time this process will be difficult.  It will probably take us two sessions to get through it.  Looking backwards it will seem easier...