# Statistics with Python

by Katharina J. Hoff, contact: katharina.hoff@uni-greifswald.de

## Audience

This course is suitable for participants who already have background knowlege on fundamental concepts of statistics. Previous programming experience is recommended (must not necessarily be in Python).

# Introduction

Python is a universal interpreted programming language. Due to clear and simple syntax, it is very easy to learn. Found in the 1990s by Guido van Rossum, Python has developed to be one of the most popular programming languages, today. It can be used a wide range of fields, e.g.:

  * Scripting
  
  * Object oriented programming
  
  * Machine learning
  
  * Statistics
  
... and probably many more. This is why we nowadays encourage you to learn Python if you have not have any programming experience yet: it's such a generally applicable language that you will be prepared for almost any future programming situation.

In this course, however, we will focus on how to peform statistical data analysis with Python. Some of you probably have had previous experience with the programming language R. You will see that in terms of statistics, R is in some areas still a superior language. However, switchting to Python might be worthwhile, nevertheless. It is a very rapidly evolving programming language with increasing popularity in data science.

Python is a free and powerful programming language that is compatible with many operating systems (e.g. Linux, OS X, Windows). 

In this course **Python 3** is used.

## Usage Modes

Python offers different usage modes: interactive and non-interactive. In this course, we will only use the interactive mode.

This course uses the popular Jupyter environment and Jupyter Notebooks offering an interactive environment where we can enter commands and immediately see the results. The temporary results of each command are held in memory till the interactive environment is shut down.

Try it on your own, e.g. <tt>enter 1 + 1</tt> or <tt>print("Hello world!")</tt> in the empty cell below and press **Ctrl + Enter** to execute the content of the currently selected cell:


In [1]:
1 + 1

2

### Useful Keyboard Shortcuts
<br>

<font size="3">
    
| Shortcut | Function |
| -------- | ----------- |
| Esc      | Switch to command mode |
| Enter    | Switch to edit mode |
| B        | Creates new empty cell **B**elow |
| H        | Show **H**elp   |
| X        | Deletes currently selected cell|
| Shift + Enter | Run cell and advance to next cell |
| Ctrl  + Enter | Run cell |
| Ctrl  + S     | Save notebook |

The frame color of the currently selected cell changes from blue in command mode to green in edit mode.

</font>

## JupyterHub

JupyterHub is a Jupyter environment running on a remote server of the university (which we're using right now). It is accessible from within the university network or remotely from home via the VPN client. Therefore, a local installation is not necessary.
<div class="alert alert-warning" role="alert">
    <b>If you're connected to eduroam, you can directly access the JupyterHub via</b>
    <a href="https://jupyterhub.wolke.uni-greifswald.de/hub/login">https://jupyterhub.wolke.uni-greifswald.de/hub/login</a> using your personal login credentials from the university data center.
</div>

Open a terminal and run the following command to clone the course materials to your Jupyter notebook instance:

`git clone https://github.com/KatharinaHoff/PythonStatistics.git`

## Local Setup

If you'd like to write and test your code independently from the university infrastructure (e.g. after the course) and start from scratch, install the Jupyter environment locally on your machine, following the instructions below.

### Linux

The easiest way is to download and install the Anaconda distribution here: <br>
https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh

### Mac OS X

The easiest way is to download and install the Anaconda distribution here: <br>
https://repo.anaconda.com/archive/Anaconda3-2020.07-MacOSX-x86_64.pkg

### Windows

The easiest way is to download and install the Anaconda distribution here: <br>
https://repo.anaconda.com/archive/Anaconda3-2020.07-Windows-x86_64.exe

</font>

## Installation of Python Add-On Packages

A huge amount of add-on packages for Python is available. Many are listed at https://pypi.org/. To install stuck a package in your local Python environment, use pip. In the following, we will try to install all packages that we will use in this course (some might already be present):

In [2]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install itertools

[31mERROR: Could not find a version that satisfies the requirement itertools (from versions: none)[0m
[31mERROR: No matching distribution found for itertools[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install matplotlib

Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install scipy

Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install statsmodels

Note: you may need to restart the kernel to use updated packages.


In [8]:
pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1316 sha256=4b629b78cb8393a4756127591c87fb73f0d2e43e46529d2b16fd008ea5000505
  Stored in directory: /home/jovyan/.cache/pip/wheels/22/0b/40/fd3f795caaa1fb4c6cb738bc1f56100be1e57da95849bfc897
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0
Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install seaborn

Note: you may need to restart the kernel to use updated packages.


## Documentation and Help System

Python comes along with a built-in help system. To call help on a particular "object", try running <tt>help("objectname")</tt>:

In [10]:
help("print")

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



If the built-in help is insufficient to assist you, we recommend querying a common web search engine. For example enter "Python3 print example" or similar in order to find assistance.

## Pocket Calculator

Python can be used as a simple pocket calculator for addition, subtraction, multiplication and division. Also logarithms et cetera are calculated easily:

In [11]:
print(4+7)
print(3*2)
print(6-1)
print(10/2)

11
6
5
5.0


For slightly more complex tasks, you will need to import a math library. We recommend Numpy:

In [12]:
import numpy as np

In [13]:
print(np.log(2)) # natural logarithm
print(np.log10(2)) # base 10 logarithm
print(np.exp(2)) # e^1

0.6931471805599453
0.3010299956639812
7.38905609893065


If you attempt to perform an impossible operation, Python will warn you and produce an "nan", which means "not a number":

In [14]:
np.log(-1)

  np.log(-1)


nan

You can store values, e.g. numbers, in variables:

In [15]:
a = 89
b = 45
result = (a+b)**2 # **2 is to the power of 2
print(result)

17956


<font size="3"><div class="alert alert-warning"><b>Exercise 1.1:</b> <br> 
    
Compute in Python the second binomial formula $$(a-b)^{2}$$ using a = 12 and b = 7. Create the objects a and b! Save the result in a variable!
    
</div>

<b>Try it yourself:</b></font>

**Example Solution:**

In [16]:
a = 12
b = 7
c = (a-b)**2
print(c)

25


Objects will be overwritten without any warning. A definite name avoids this to a certain extent, e.g. 
**binom.formula.of.a.b** instead of **result**. Even functions can be overwritten with object names easily. The safest method is therefore to enter the name of interest into the Python Console or a Notebook cell. If there is a function with this name existing, it will be returned, e.g.:

In [17]:
print

<function print>

In [18]:
polarbear # does not exist, feel free to use this as a new object name

NameError: name 'polarbear' is not defined

Some good practice for choosing variable, object, function and method names:

  * do not begin with a number, do not begin with a dot
  
  * avoid special characters as e.g. ~, @, !, #, %, & 
  
  * upper and lower cases have to be considered
  
  * choose names that make sense, i.e. have a meaning for humans

## Data Types

Python supports many data types, here are some of the most commonly used ones:

In [19]:
print(type(2))          # integer
print(type(5.2))        # float
print(type("hello"))    # string
print(type(True))       # boolean

<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>


Python auto-types variables upon value assignment, and it also auto-converts otherwise incompatible data types. We thus do not need to worry much about data types in the field of statistics.

## Data Structures

Some of the must fundamentel built-in data structures are tuples (immutable), lists (mutable) and dicts (key-value, mutable):

In [20]:
my_tuple = (5,6,1)  # tuples are initialized with round brackets
print(my_tuple)
my_tuple[0] = 2     # tuples are immutable, the attempt to change the first value will fail

(5, 6, 1)


TypeError: 'tuple' object does not support item assignment

In [21]:
my_list = ["a", "b", 1] # lists are initialized with edgy brackets
print(my_list)
my_list[0] = "z" # lists are mutable
print(my_list) 

['a', 'b', 1]
['z', 'b', 1]


In [22]:
my_dict = {"key1" : "value 1", "key2" : 3, 84 : True} # dicts are initialized with curley brackets
print(my_dict)
my_dict["key4"] = False # dicts are mutable
print(my_dict)

{'key1': 'value 1', 'key2': 3, 84: True}
{'key1': 'value 1', 'key2': 3, 84: True, 'key4': False}


Numpy, the library that allows us to perform many math operations, comes along with an array data structure:

In [23]:
A = np.array([[1, 2, 3],[4, 5, 6]]) # a 2x3 matrix = 2-dim array
print(A)
A[1,2] = 10 # mutating second row (1) and third column (3)
print(A)

[[1 2 3]
 [4 5 6]]
[[ 1  2  3]
 [ 4  5 10]]


For statistics, the Pandas data frame is an essential data structure, it allows us to easily import tables it Python. For using Pandas, you first have to import:

In [24]:
import pandas as pd

In [25]:
melon = pd.read_csv('data/melon.csv', sep='\t')
melon

Unnamed: 0,variety,yield
0,A,25.12
1,A,17.25
2,A,26.42
3,A,16.08
4,A,22.15
5,A,15.92
6,B,40.25
7,B,35.25
8,B,31.98
9,B,36.52


<font size="3"><div class="alert alert-warning"><b>Exercise 1.2:</b> <br> 

*Brassica campestris*, also known as "Wisconsin Fast Plant", has a rapid growth cycle. Therefore, this
model plant is particularly suitable for experiments that determine factors which in
uence plant growth. In a study, seven plants were treated with ancymidol (<tt>ancy</tt>), eight plants served as <tt>control</tt> and were treated with water, instead. Ancymidol is a growth inhibitor and is used as a herbicide. The height of all plants was measured after 14 days (cm). Part of the data set is contained in <tt>data/brassica.csv</tt>. Import the data into a pandas data frame <tt>brassica</tt>.
    
</div>

<b>Try it yourself:</b></font>

**Example Solution:**

In [26]:
brassica = pd.read_csv('data/brassica.csv', sep='\t')
brassica

Unnamed: 0,group,height
0,control,10.0
1,control,13.2
2,control,19.8
3,control,19.3
4,control,21.2
5,control,13.9
6,control,20.3
7,control,9.6
8,ancy,13.2
9,ancy,19.5


Just as easily as importing data sets, you can export pandas dataframes to csv format:

In [27]:
melon.to_csv(r'data/melon2.csv', index = False, sep = "\t", header = True)

## Useful Functions for Generating Data

Generating a sequence of numbers:

In [28]:
sequence = np.arange(0, 10, 1)  # start (included): 0, stop (excluded): 10, step:1
print(sequence)

[0 1 2 3 4 5 6 7 8 9]


Repeating array elements:

In [29]:
rep = np.repeat(3, 4)
print(rep)

[3 3 3 3]


In [30]:
rep2 = np.repeat(np.array(["A", "B"]), 2)
print(rep2)

['A' 'A' 'B' 'B']


<font size="3"><div class="alert alert-warning"><b>Exercise 1.3:</b> <br> 
    
Create an array with the content "1,1,2,2,3,3,4,4,5,5" in Python. Do not type all numbers twice!
    
</div>

<b>Try it yourself:</b></font>

**Example Solution:**

In [31]:
Y = np.repeat(np.array([1,2,3,4,5]), 2)
print(Y)

[1 1 2 2 3 3 4 4 5 5]


<font size="3"><div class="alert alert-warning"><b>Exercise 1.4:</b> <br> 
    
Create an object containing the reverse running numbers from 28 to 50!

    
</div>

<b>Try it yourself:</b></font>

**Example Solution:**

In [32]:
Z = np.arange(28, 51, 1)
print(Z)

[28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50]


## Subsetting Data Structures

In general, you can subset a single element by using edgy brackets (you have seen some examples above, already). There are two main things to be kept in mind:

  * The index of elements is zero-based, i.e. the first element of a data structure has the index 0.
  
  * Rows come before columns (in terms of dimensions, e.g. for a 2D matrix or a data frame).
  
In the following, we will have a look at in particular the access to data in pandas data frames at the example of the melon data set.

In [33]:
print("access lines 1 and 2, all columns:") # line index is 0-based
print(melon.iloc[1:3,:])
print("access a particular single value as scalar:")
print(melon.iloc[1,1])
print("access a column by name:")
print(melon["variety"])
print("find parts of dataframe where yield >37:")
print(melon[melon['yield'] > 37])
print(melon[melon['variety'] == 'D'])

access lines 1 and 2, all columns:
  variety  yield
1       A  17.25
2       A  26.42
access a particular single value as scalar:
17.25
access a column by name:
0     A
1     A
2     A
3     A
4     A
5     A
6     B
7     B
8     B
9     B
10    B
11    B
12    C
13    C
14    C
15    C
16    C
17    C
18    D
19    D
20    D
21    D
22    D
23    D
Name: variety, dtype: object
find parts of dataframe where yield >37:
   variety  yield
6        B  40.25
10       B  43.32
11       B  37.10
   variety  yield
18       D  28.55
19       D  28.05
20       D  33.20
21       D  31.68
22       D  30.32
23       D  27.58


Recommended for further reading: Further reading: 

  * https://pandas.pydata.org/ 
  
  * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

<font size="3"><div class="alert alert-warning"><b>Exercise 1.5:</b> <br> 
    
Create an array named X with values "3,4,-5,7,8,12,10,4,-3". What happens if you apply the following command: 1. <tt>X[X<0]</tt>
    
</div>

<b>Try it yourself:</b></font>

**Example Solution:**

In [34]:
X = np.array([3,4,-5,7,8,12,10,4,-3])
print(X[X<0])

[-5 -3]


<font size="3"><div class="alert alert-warning"><b>Exercise 1.6:</b> <br> 

Show the following 'subsets' of data:
    
  [a] The third element of the array <tt>X = np.repeat(np.arange(1,6,1), 2)</tt>
    
  [b] The second row (index 1) of array <tt>mat = np.arange(1,13).reshape((3, 4))</tt>. Hint: if you want to indicate an entire row or column, use colon operator (:).
    
  [c] The first column in the above array.
</div>

<b>Try it yourself:</b></font>

**Example Solution:**

In [35]:
# a
X = np.repeat(np.arange(1,6,1), 2)
print(X)
print(X[2])

# b
mat = np.arange(1,13).reshape((3, 4))
print(mat)
print(mat[2,:])

# c
print(mat[:,0])

[1 1 2 2 3 3 4 4 5 5]
2
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[ 9 10 11 12]
[1 5 9]
