# Project Exercises

During the workshop we will interrupt the input and use some of our new-found knowledge in a small project. The data for this project is taken from the following study. 

## The general idea behind the exercises

As a general comment on the following exercises: I strongly believe in figuring stuff out by yourself as a learning tool, which means good exercises should now purely repeat stuff that we already did 1:1. This is especially true in programming where figuring out how stuff works using the documentation, google or stackoverflow is part of the deal.

As a consequence, I tried to design the project assignments in a way that are largely -but now completely- solvable using the stuff we covered so far. That means if you read the exercise and feel like "I don't know how to do that", don't worry - that's by design! Feel free to use the internet!

However, I am aware of the heterogeneity in tech-savyness and preexisting programming knowledge. So for the assignments during the online workshop, we will now split the group into two parts: 

 1. Those who feel comfortable taking the challenge on on their own go to the breakout room. 
 2. Those who prefer a more guided approach or are uncertain whether they manage the assignment on their own can stay here. We will work out a solution together. 
 
Both options are absolutely fine, so don't feel pressured! You can also change your mind midway and change the room.

There are a total of 6 Exercises that should familiarize you with some basics in the scientific python stack:

 1. Read the data
 2. Split the data set
 3. Descriptive statistics in numpy
 4. Mean comparison in scipy
 5. Linear regression in statsmodels
 6. Plot the results in matplotlib
 
You can complete the assignments in this notebook or move to a `.py` file for that. It's up to you.
 
The schedule for the workshop is **packed**. So if we manage to actually complete all these exercises, that's absolutely amazing. If not, that's no problem. I'll provide an example solution and you can try it yourself later on. You've still learnt a lot of Python in just 2.5 days.

## The data

The data for this project is taken from the following study. 

[Kerry, N., & Murray, D. R. (2021). Physical Strength Partly Explains Sex Differences in Trait Anxiety in Young Americans. Psychological Science, 32(5), 809–815. https://doi.org/10.1177/0956797620971298](https://journals.sagepub.com/doi/full/10.1177/0956797620971298)

You can find the files in the `data` folder. Originally it was a `.sav`, i.e. SPSS file. I took the liberty to select a few columns and produce a `.tsv` file that we can read into python more easily. The topic of the study is not really important. I chose this study because it provides a nice simple data set that we can use to implement some of the scientific python that we cover in the notebooks. It was suprisingly hard to find a data set that was a) not too complicated and b) where the data was shared in a way that is actually usable. (Having looked at quite a few OSF repos until I found this one, I think *Psychological Science* should be more picky with handing out badges for open data and open code. If your spaghetti code is not documented and your data is pretty much unusable because there is no README that tells users which file contains which data, you might as well not share code and data. But that's besides the point.)

Still, to give you a rough idea about the study: The variables that we will look at are

 * `Grip`: A measure of grip strength
 * `Anxiety`: A questionnaire-based estimate of trait anxiety
 * `Sex`: Biological Sex, coded as 0=male and 1=female
 
This kind of tabular data could be very well handled with `pandas`. In practice I would probably use that to be honest, but since we stick to the design principles and you should know `numpy`, we'll use that instead.

## 1: Read the data

The first step is of course to load the data into python. Write a function that reads the `.csv`-file in the data folder. The data consists of multiple variables. The variable names are in the first row. Your function should follow this structure:

```python
def read_data(filename):
    ...
    return data
```

To be more specific, I want the data to consist of a dictionary with one entry per variable. The keys should be the variable names and the values a numpy array.

You can use any of the methods outlined above. `pandas` is probably the easiest. If you go for the vanilla python one, you will need `str.split` and `str.strip` methods of strings and either the `int` or `float` class. If you're using `numpy`, you will have to read the column names separately.

Have fun!

In [3]:
# your code here. feel free to add cells as needed while you're figuring out how to do that.


Bonus if you're done early: Read the `.sav` file, extract the relevant columns and rewrite it as a `.tsv` file.

In [5]:
# your code here


## 2: Split the data set

So far the data we have combines men and women. We will compare means between them later, so we need to split up the data. You could do that ad hoc, but it's a good exercise to practice loops and boolean indices. The `Sex` column is coded as 0=male, 1=female. 

Write a function that splits the data set into two subsets, one for men and one for women. The `Sex` variable is coded as 0=male, 1=female. 

This is what it could look like:
```python
def split_data(full_data):
    ...
    return male_data, female_data
```

In [4]:
# your code here


## 3: Descriptive statistics in numpy

Next step in the project before we go into the more interesting analysis: Descriptive statistics. 

Write a function that computes and prints appropriate statistics for the variables in the data set, i.e. mean and standard deviation for `Grip` and `Anxiety` and percent male/female for `Sex` (hint: The mean times 100 gives % female). Use the full dataset, i.e. not split by `Sex`. Your variable could either return nothing and just print the results to the screen. Or it could return an appropriate data type that contains the statistics, like one or multiple dictionaries, e.g.:

```python
def desciptive_stats(data):
    ...
    return {
        'Grip': {
            'mean': ...,
            'std': ...,
        },
        'Sex': {
            '%female': ...,
            '%male': ...
        },
        ...
    }
```

In [6]:
# your code here


## 4: Mean comparison in scipy

Use t-tests to compare `Anxiety` and `Grip` between men and women. You don't necessarily have to write a function for that, but you certainly can!

In [4]:
# your code here


## 5: Linear regression in statsmodels

See if you can predict `Anxiety` from `Grip` using linear regression. Use the full sample for that.

In [4]:
# your code here


Bonus: repeat the analysis, but split by `Sex`.

In [4]:
# your code here


## 6: Plot the results in matplotlib

Create a figure with three subplots:
 1. One with a histogram of `Grip`
 2. One with a histogram of `Anxiety`
 3. One with a scatterplot of the two variables.

In [4]:
# your code here


Bonus: Repeat the plots, but this time use different colors for men and women. Compare the height-histograms in the 11th notebook!

In [4]:
# your code here
