# Python for data analysis

Powerful, easy-to-use dynamic language, with huge ecosystem of useful packages:
+ **NumPy** for manipulation of homogeneous array-based data
+ **Pandas** for manipulation of heterogeneous and labeled data
+ **SciPy** for common scientific computing tasks
+ **Matplotlib** for publication-quality visualizations
+ **IPython** for interactive execution and sharing of code
+ **Scikit-Learn** for machine learning
+ **Jupyter** for interactive notebooks, a possible alternative for spreadsheets
+ many others ...

This notebook introduces Python  itself. It can only touch on some of the more advanced features of the language, but should be sufficient as a base from which to start exploring this ecosystem. Notebooks introducting Numpy, Pandas and Matplotlib will follow.

## Anaconda ##

[Anaconda](https://www.anaconda.com/download/) is a popular distribution for data science. It contains Python, conda (a package manager) and many packages, including the ones mentioned [earlier](#Python-for-data-analysis).

If you are strapped for space, get [Miniconda](https://conda.io/miniconda.html): this only installs Python and the conda package manager. You can then use the latter to install whatever else you may need for your specific projects. 


### Python 2 or Python 3? ###

As is still usual in the Python ecosystem in general, Anaconda offers a choice of Python 2.x and Python 3.x versions. The default should be to go for Python3, as time is starting to [run out](https://pythonclock.org/) for Python2.

Most packages have been converted from 2 to 3 anyway: [PyReadiness checklist](http://py3readiness.org/)
        
For more history and discussion on this, see [here](https://wiki.python.org/moin/Python2orPython3).

### Hello world
[Hello world](https://helloworldcollection.github.io/#Python%C2%A02)

# Python
Python is both an interpreter and a language. The interpreter executes statements or evaluates expressions written in Python. The interpreter is an executable that runs on all familiar (and most unfamiliar) platforms. It is also the engine (kernel) that executes the code in this Jupyter notebook. 

To start the interpreter, type `python` in a terminal or command prompt window. This will give you a Python prompt (>>>). Enter a Python statement or expression and hit return. Python will return a result and wait for your next move. To quit the interpreter, type `quit()`.

![alt text](figs/pythonprompt.jpg)


Let's do some calculations:

![alt text](figs/pythonpromptlater.jpg)


# IPython
If you have installed Anaconda, you can also start IPython. This adds a number of useful enhancements to your basic Python interpreter. 

![alt text](figs/ipython.jpg)

As you can see, it numbers your inputs and outputs. This makes it possible to reuse these later on. But IPython offers much more magic, as we will see during this course.

IPython is the kernel being used to execute the Python code we can enter in code cells in notebooks like this one.

In [2]:
print("Hello woorld")

Hello woorld


# Spyder 
IPython is also used by Spyder, the Integrated Development Environment (IDE) included in Anaconda. We will stick to Spyder in this course, but be aware that there are many (excellent)
alternative IDE's for Python development on offer:
+ IDLE (comes with the Python distributions you can get at [Python.org](https://python.org)
+ [Eclipse](https://eclipse.org) with [Pydev](http://www.pydev.org/) plugin (free, opensource)
+ [PyCharm](https://www.jetbrains.com/pycharm/) (free Community Edition)
+ your preferred text editor 
+ etc.

# Python code
Python code consists of (blocks of) statements and expressions. 

Statements are commands that have an effect: they are *executed* and do not have a value. 

Expressions are *evaluated*: they can be reduced to a value, which can be any Python object. Expressions only contain identifiers, literals and operators, such as arithmetic and boolean operators, the function call operator () and the subscription operator []. Expressions cannot contain statements.

Statements and expressions can be entered interactively, or collected in a file and entered as a batch by either:
+ starting the interpreter with the file as an argument ($ `python myfile.py`)
+ loading and executing them from within Python, e.g. by using **`import myfile`**


# Layout matters! 
The layout, or at least the indentation, of your code is all-important in Python. While other languages generally use some type of bracketing to structure code (i.e. to disambiguate what belongs together in the same block of code), Python uses indentation to structure code. 

So instead of say `{x, y, {a,b, {c, d}, e}, z}`, which could just as well be written as say 

`{x, 
y, {a,b, {c, 
d}, e}, z}`

Python requires a layout like this:

`
x 
y
   a
   b 
       c
       d
   e
z`

Statements or expressions with the same indentations belong to the same block. The amount of whitespace is immaterial and may vary between blocks, as long as the same indentation is used within a block. The convention is to use four spaces per level (do not use tabs).
 
In other words, white space matters, but only in so far as it affects indentation: within a single statement, e.g. a list of items, the amount of whitespace separating its components is irrelevant.

To recap, an expression such as `1+2` has a value: 

In [4]:
1+55

56

while a statement such as `a = 3` has no value, but an effect:

In [5]:
a = 3

This assignment causes a new name, in this case *a* to be added to the list of known names (the *namespace*), bound to the value of the right operand of the assignment operator **`=`** (not to be confused with the **`==`** equality operator). We can now use this name as a (bound) variable in expressions:

In [6]:
a

3

Unbound names cannot be used, the interpreter will raise a NameError and execution will stop.

In [18]:
b

4

We can use these variables in expressions such as `a+b` or `a,b`. The later uses the comma as a tuple operator.

Note the use of the semi-colon to separate different statements on the same line (the convention is to stick to one statement per line).

In [1]:
b = 4; print(a,b)


NameError: name 'a' is not defined

In [20]:
b = 4; a,b

(2, 4)

While statements have effects and no value, expressions have a value, and may have a side effect. In Python2 print is a statement: `print 9` In Python3 print is a function: `print(9)`. It returns the None object, the object used in Python to represent *nothing*, (*null, nil, void, ...*).

In [8]:
a=print(9)

9


In [9]:
a

# Datatypes
Programs may use bits to represent all sorts of data (such as numbers, strings, lists, tables, records and many others). To keep track of which bits stand for what data (the memory requirements of a particular type) and to know what functions are meaningful on some piece of data, we need to specify the data type. In language such as C variables (names) can only refer to a specific type of data: this type needs to be declared before a variable can be used. A variable of type int can only have integers as its value (with some maximum), floats can only be assigned to variables of type float and so on.

In Python the type is stored with the data, not with the variable. This means that all data in Python share a number of attributes, however different the actual data itself may be.
## Everything is an object

Numbers, strings, function definitions, whatever users of Python may create, they always have:
+ a **type**, retrievable by **type**(*obj*) 
+ a **unique id**, retrievable by **id**(*obj*) 
+ a **value**

This value can simply be a number, it can be a resource such as a file or window, or it can be a collection of references to other objects. 

Types are thus a property of a data object, not of a variable. A variable is just a name, and can refer to any object, whatever the type of that object. It can be reassigned runtime to any other object, it can even be unbound, scrapped from the record so to speak, by using the del statement: `del name`. 

The type and id of an object can never be changed. The value of some objects (e.g. numbers and strings) can also never be changed. Such types are **immutable**. The value of **mutable** types, such as list, set or dictionary, can be changed (items can be added, removed or updated). Note: the mutability of objects in a collection type is determined by the type of the object, not of its container, so even if collections such as tuples are themselves immutable, their elements might well still be mutable. 

In [15]:
a = "fff"; b = a; a= 2; b, a

('fff', 2)

### Even functions are objects

In [11]:
print(type(print), id(print))
write = print
print(type(write), id(write))
write("hello world")
del write

<class 'builtin_function_or_method'> 2169292554104
<class 'builtin_function_or_method'> 2169292554104
hello world


# Types and classes
In Python3 the terms *type* and *class* are used interchangeably. Python always had types. Classes were added to allow custom types to be added. Built-in types and custom classes are treated similarly. We will look into this in more detail later on, for now it suffices to note that classes (types) define what operations (functions, methods) are supported by objects of that type. 

Adding two numbers is not the same as adding two vectors or adding two strings. We could of course define an *integer-add* and a *vector-add* and a *string-add*, and leave it to the users of these to apply the right function to the right type of objects. By defining these various add functions as class methods, we immediately gain an advantage in naming: we can now refer to *string.add* and *vector.add* and *integer.add*. In other words we can reuse the name *add* and stick the name of the class in front to disambiguate these methods.

Easier and less error-prone is to let the system figure out itself which particular add method should be used to add two particular objects. Python has implemented such *dispatching on the first argument* by allowing us to say **x.add(y)**. 

Depending on type(x), this would be equivalent to: int.add(x,y), or str.add(x,y) or vectoradd(x,y).

Note: this is just to illustrate the idea, for these built-in types things work slightly different, we will come back to this later. For now it is important to understand the notations used: "hhh".upper() is equivlent to str.upper("hhh")

In [13]:
"hhh".upper() == str.upper("hhh")

True

In [3]:
float.is_integer(x)

False

## Dot notation
The dot notation is used not only to access methods, but also attributes: **obj.att** and **obj.method()**

This is even true of simple objects like numbers:

In [2]:
x = 4.5
print(x.real, x.imag)
4.0.is_integer()

4.5 0.0


True

# Basic types
Python comes with a number of standard types:
+ numerics
+ sequences
+ sets
+ mappings
+ callables
+ classes (custom types, i.e. types that you can add to the built-in types)
+ instances of classes
+ I/O (file) objects
+ modules 
+ a few single value types (None, Ellipsis and NonImplemented)
+ internal types

## Comparisons

The Python language provides a number of operators. Types differ in their support for these operators. The following table summarizes the most generally supported type of operator, the comparison operator. While not all of these are supported by all basic types, many of them are. Furthermore, as we will see, most of these operators can be customized for custom classes.

The `is` and `is not` operators are special in this regard. These two operators can be applied to any two objects and will always be supported, as they simply compare their id's. They cannot be customized.

|Operation 	|Meaning|
|:----|----:|
|`<` 	|strictly less than|
|`<=` |	less than or equal|
|`>` |	strictly greater than|
|`>=` |	greater than or equal|
|`==` |	equal|
|`!=` |	not equal|
|`is `|	object identity|
|`is not`|	negated object identity|

<!--NAVIGATION-->
[Off to the Whirlwind notebooks](Whirlwind/04-Semantics-Operators.ipynb)

# Numerics
## bool
This is the boolean type, with just two instances: True and False. The standard boolean operators **`and`**, **`or`** and **`not`** are called just that in Python. For historic and pragmatic reasons, this type is a subtype of int, with False acting as 0 and True as 1 in arithmetic operations, so True + 4 is legal and equals 5.

Having a bool type does not mean that False is the only object that will be interpreted as false: other "falsity" values, besides 0, are e.g.:
+ an empty set {}
+ an empty dictionary {}
+ an empty string ""

While built-in operations that are meant to return a boolean will generally return a bool, `and` and `or` may not, given that these return the first argument that determines their outcome. 

In [None]:
True + 6, 12 or None

In [None]:
bool(-3), bool(0), bool(False), bool("False"), bool("")

# Numeric types: int, float, complex
Integers have unlimited precision in Python3 (in Python2 int is at least 4 bytes, long at least 8). Float's are usually implemented using the C double type.

In [21]:
19**50

8663234049605954426644038200675212212900743262211018069459689001

In [87]:
import sys
print(sys.float_info)
print("0.1 = {0:.17f}".format(0.1))
print("0.2 = {0:.17f}".format(0.2))
print("0.3 = {0:.17f}".format(0.3))
print(0.1 + 0.2 + 0.3 == 0.6)
print("0.6 = {0:.17f}".format(0.6))
print("0.6 = {0:.17f}".format(0.1 + 0.2 + 0.3))

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
0.1 = 0.10000000000000001
0.2 = 0.20000000000000001
0.3 = 0.29999999999999999
False
0.6 = 0.59999999999999998
0.6 = 0.60000000000000009


All numeric types support the following operations:
![alt text](arithops.jpg)

![alt text](arithadditional.png)

Notice that for some operations you need to import the **math** module.

## Numeric literals
Integer literals can be in decimal, binary, octal and hexadecimal format:

In [None]:
oct(28), hex(28), bin(28)

In [89]:
int('34'), int('34',16)

(34, 52)

### To manipulate binary numbers:

![Bitwise operators](bitwise ops.png)

In [78]:
~0 ^ 10

-11

### Conversions
An integer can be converted to a float and vice versa. As we will see later on, these int and float functions are not converters, but constructors of objects of class int or float. The numeric value are just arguments to these constructors. 

In [83]:
(id(9), id(float(9)), id(int(float(9))))

(1590160080, 1718896961096, 1590160080)

## Floats
Literals can be entered using decimal or scientific notation

In [90]:
44.789, 0.44789e2, 447.89E-1

(44.789, 44.789, 44.789)

## Complex numbers
Imaginary numbers are marked using the j suffix.

In [96]:
a = complex(5,4)
b = 9+3j
c = a.real + b.imag
d = complex(a.real, b.imag)
print(a,b,c,d)

(5+4j) (9+3j) 8.0 (5+3j)


float

# Sequence types
+ **list**: a mutable sequence of objects of any type
+ **tuple**: an immutable sequence of objects of any typeb
+ **range**: an immutable range of numbers
+ **str**: a text string, i.e. an immutable sequence of characters
+ **bytes / bytesarray**: immutable / mutable sequence of bytes

### All sequence types support:
![Common sequence operations](commonseqops.png)
