# 10 Basic statistics

In this notebook we will cover descriptive stats with `numpy`, t-tests and probability distributions using `scipy` and multiple regression using `statsmodels`. There's a ton of ways to do these things, so those packages were chosen to show you a bit of the width of possibilities. E.g. I'd probably use `scikit-learn` for regressions or `pystan` if I wanted a Bayesian regression. In this notebook we'll also proceed with the mini project.

## 10.1 Descriptive statistics using `numpy`

In [5]:
import numpy as np

As you know, there's only a small number of built-in functions in Python, `mean` and `std` not being among them. Luckily numpy provides us with these, either as functions (`np.mean`) or as class methods `array.mean`. By default, these functions work along the whole array:

In [17]:
A = np.random.randn(10, 5)
print('mean:', A.mean())
print('std:', A.std())

mean: -0.07172073438985951
std: 0.955791447389602


If you want to apply these methods over different dimensions (e.g. get the column means), you have to specify the `axis`:

**Exercise**

Get the mean and standard deviations for the columns of A:

In [19]:
# your code here

**Project Exercise 3**

Next step in the project: Descriptive statistics. Write a function that computes and prints appropriate statistics for the variables in the data set, i.e. mean and standard deviation for `Grip` and `Anxiety` and percent male/female for `Sex`. Use the full dataset, i.e. not split by `Sex`. As usual, feel free to play around in a notebook and transfer the function to the project file when you're done.

## 10.2 Mean comparison using `scipy.stats`

As seen above `numpy` provides some basic descriptive stats functionality. When we're doing inferential stats, we're beyond the realms of `numpy`. But that's not a problem because now we'll get to know the next important package of the scientific python stack: `scipy`. And to be more specific, we're going to use the `stats` module.

In [21]:
from scipy import stats

This module contains all kinds of probability distributions, but also some functions for standard statisticals methods: ttests, anova, univariate regression. The function for an independent sample t-test is `stats.ttest_ind`. It expects at least two arguments, which are the data points from the two samples to compare:

```python
result = stats.ttest_ind(a, b)
```

In addition you can give arguments to e.g. use a permutation test instead of the canonical t-distribution to get the p-values.

In [23]:
a = np.random.randn(100) + 1
b = np.random.randn(100) + 4

**Exercise**
Check how likely it is to arrive at a mean difference that is at least as large as the one between `a` and `b`, assuming that they are drawn from the same population. (i.e. compute an independent t-test):

In [28]:
# your code here


**Project Exercise 4**

Write a function that compares `Grip` and `Anxiety` between men and women.

## 10.3 Multiple regression using `statsmodels`

There's many ways to run multiple regression. One way is to use the `statsmodels` package that you should have installed. This uses an object oriented approach. I.e. we first create a linear regression object with the data to fit and then fit it in a second step:

```python
import statsmodels.api as sm
lin_reg = sm.OLS(y, X) # dependent variable comes first!
fit = lin_reg.fit()
fit.summary()
```

If you're design matrix doesn't have a column of ones yet (i.e. the column for the intercept), you can add one like so:

```python
X = sm.add_constant(X)
```

We'll simulate some data like we did before, except that we don't include the intercept in the design matrix:

In [56]:
X = np.random.randn(100, 2)
epsilon = np.random.randn(100) * .3
betas = np.array([5, 2.4])
intercept = 2.5
y = X.dot(betas) + intercept + epsilon

**Exercise**

Run a multiple regression to recover the betas. Don't forget to add the column of ones to the design matrix!

In [60]:
import statsmodels.api as sm
# your code here


**Project Exercise**

This is a bigger one. We might postpone it to make sure that there is a bit of time for matplotlib.

The point of the paper is that there's a relationship between grip strength and anxiety. But if we include sex as a predictor, this should go away, i.e. it is mediated by Sex (men are higher on grip strength and lower on anxiety).

Compute two regressions. First predict anxiety from grip strength in the full sample. Then compute it again but include sex as another predictor.