# 10 Basic statistics

In this notebook we will cover descriptive stats with `numpy`, t-tests using `scipy` and multiple regression using `statsmodels`. There's a ton of ways to do these things, so those packages were chosen to show you a bit of the width of possibilities. E.g. I'd probably use `scikit-learn` for regressions or `pystan` if I wanted a Bayesian regression. In this notebook we'll also proceed with the mini project.

## 10.1 Descriptive statistics using `numpy`

In [None]:
import numpy as np

As you know, there's only a small number of built-in functions in Python, `mean` and `std` not being among them. Luckily numpy provides us with these, either as functions (`np.mean`) or as class methods (`array.mean`). By default, these functions work along the whole array:

In [None]:
A = np.random.randn(10, 5)
print(f"{A.mean() = }")
print(f"{A.std() = }")

If you want to apply these methods over different dimensions (e.g. get the column means), you have to specify the `axis`, i.e. the dimension along which you want to perform the operation:

**Exercise**

Get the mean and standard deviations for the columns of A:

In [None]:
# your code here


**Project Exercise 3**

Now head over to the project notebook and complete exercise 3!

## 10.2 Mean comparison using `scipy.stats`

As seen above `numpy` provides some basic descriptive stats functionality. When we're doing inferential stats, we're beyond the realms of `numpy`. But that's not a problem because now we'll get to know the next important package of the scientific python stack: `scipy`. And to be more specific, we're going to use the `stats` module.

In [None]:
from scipy import stats

This module contains all kinds of probability distributions, but also some functions for standard statisticals methods: ttests, anova, univariate regression. The function for an independent sample t-test is `stats.ttest_ind`. It expects at least two arguments, which are the data points from the two samples to compare:

```python
result = stats.ttest_ind(a, b)
```

In addition you can give arguments to e.g. use a permutation test instead of the canonical t-distribution to get the p-values. The output is very parsimonious, only a the t- and p-values are reported.

In [None]:
a = np.random.randn(100) + 1
b = np.random.randn(100) + 4

**Exercise**
Check how likely it is to arrive at a mean difference that is at least as large as the one between `a` and `b`, assuming that they are drawn from the same population. (i.e. compute an independent t-test):

In [None]:
# your code here


**Project Exercise 4**

Time to put the next chunk of knowledge into practice. Back to the project notebook.

## 10.3 Multiple regression using `statsmodels`

There's many ways to run multiple regression. One way is to use the `statsmodels` package that you should have installed. This uses an object oriented approach. I.e. we first create a linear regression object with the data to fit and then fit it in a second step. So the procedure works like that:

 1. Import the correct module from the statsmodels package. This one uses numpy arrays. There is another module which is meant to be used with named arrays (such as `pandas` data frames). That way you can use equations to specify the model like you might know it from `R`.
 
```python
import statsmodels.api as sm
```

 2. Then you create a regression object. In this step you already specify the dependent variable and the design matrix (i.e. the independent variables). If you want to fit an intercept and your design matrix does not contain a column of ones, you can add one with `statsmodels`. Although of course you already know how to do that with only `numpy`.
 
```python
# optionally, add intercept column:
X = sm.add_constant(X)
lin_reg = sm.OLS(y, X) # dependent variable comes first!
```

 3. Now you can use the linear regression object to fit the model. This will return a detailed fit object.
 
```python
fit = lin_reg.fit()
```

 4. This fit object provides you with methods that report the model fit. E.g. the `fit.summary()` method:
 
```python
fit.summary()
```

We'll simulate some data like we did before, except that we don't include the intercept in the design matrix but just add it as a scalar. This is equivalent.

In [None]:
X = np.random.randn(100, 2)
epsilon = np.random.randn(100) * .3
betas = np.array([5, 2.4])
intercept = 2.5
y = X.dot(betas) + intercept + epsilon

**Exercise**

Run a multiple regression to recover the betas. Don't forget to add the column of ones to the design matrix!

In [None]:
import statsmodels.api as sm
# your code here


**Project Exercise 5**

Next assignment! This is a bigger one. We might postpone it to make sure that there is a bit of time for matplotlib.