# Lab 1: Data Science in Python

This lab will introduce you to some basic tools and libraries commonly used for Data Science in python.
<ul>
<li>```numpy``` is a library that efficiently stores matrices and vectors, called arrays. Some really clever people optimised this very well, so that it can do things such as dot products and data compression very efficiently.
<li>```pandas``` offers some more R-like dataframes. A dataframe is more flexible than an array (for example, it can have strings and numbers in the same column).
<li>```matplotlib``` is a plotting library.
<li>```seaborn``` is built on top of matplotlib, and offers a few additional perks in the same style as ```matplotlib```.
</ul>

## 1. Jupyter
Jupyter notebooks are interactive python notebooks, in which you can design your code and run each cell one by one. There are code cells and markdown cells. You can select the type of cell above: Cell > Cell Type. Use markdown cells to write notes for your future self or your collaborators/readers. Code cells run your code when you hit ```Ctrl + Enter```. In <b>command mode</b>, you can navigate through cells using the up/down arrows. With ```Enter```/double click, you get into <b>execute mode</b>, where you can modify and execute cells.
<ul>
<li>Create a new cell in command mode by hitting ```a``` (new cell above) or ```b``` (new cell below).

<li>Delete a cell in command mode by hitting ```d d```.
</ul>
A very neat feature of jupyter notebooks is that the output of the last command is automatically printed below a code cell, even without the print() function.

<ol>
<li>Try creating a new cell below (by default, this creates a new code cell), and put ```34 * 567``` in there. Execute the cell to see what happens. 

<li>Then, put ```34 / 567``` in there, below your previous line of code. What gets displayed? Two numbers, or just one? Which one?
</ol>

In [50]:
34*567
34/567

0.059964726631393295

Now, let's get started. First, import these libraries. This is the standard way in which many programs import them:

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Numpy

Numpy is a library for storing and fast computing with matrices and vectors. The basic data type in numpy is a nupy array (```np.array()```). An array can be unidimensional (vector), two dimensional (matrix) or higher dimensional ("tensor"; like a matrix of matrices, etc. - but don't worry about it too much for now).

A one dimensional array is a bit like a list in python, except that it has a fixed size that has to be defined when the array is created. It also has a fixed datatype, e.g. ```np.int64``` (integers) or ```np.float```. All elements of the array must be of the same datatype, and tying to insert an item of a different data type will create errors. Here are a few ways of creating new arrays. Try to execute the cells below and figure out how these methods work.

You can also find more documentation on numpy <a href="http://www.numpy.org/">here</a>.

In [6]:
np.array([1,2,3,9,8,7])

array([1, 2, 3, 9, 8, 7])

In [7]:
np.arange(6)

array([0, 1, 2, 3, 4, 5])

With ```np.array()```, you can turn a list or iterable into a numpy array. You need to specify the items by hand. ```np.arange``` automatically produces values within a specified range. Try creating an array that has all the odd number between 10 and 20. Check out <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html">the documentation</a> to help you. 

In [51]:
#Code here
np.arange(11,20,2)

array([11, 13, 15, 17, 19])

Now run the cells below.

In [4]:
np.zeros((5,))

array([0., 0., 0., 0., 0.])

In [11]:
np.empty((5,))

array([0.00000000e+000, 2.31859903e-052, 3.60345902e+175, 1.17475197e+165,
       1.60372000e-051])

What's the difference between ```np.zeros``` and ```np.empty```? (You might not see any difference at first. If not, try executing the last cell again, until you see any difference.) What kinds of numbers are inside the "empty" array? Why is that? Check out <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.empty.html">the documentation</a> for help.

Here's yet another function that creates a new array - this can be quite handy for certain types of machine learning:

In [21]:
np.full((5,),fill_value=42)

array([42, 42, 42, 42, 42])

In [20]:
np.random.rand(5)

array([0.15026063, 0.4036821 , 0.71269563, 0.64435591, 0.87424853])

(Run the above cell multiple times, to see which numbers it produces and that they are indeed random floats between 0 and 1.)

```np.zeros()```, ```np.empty()```, ```np.full()``` and ```np.random.rand()``` take in a "shape" as argument. Valid shapes are 

(x,)  -- a vector with length x
(x,y) -- a x by y matrix
(x,y,z) -- a x by y by z tensor
etc.

You can find out an array's shape by using its ```.shape``` attribute:

In [26]:
a = np.random.rand(6,2)
a

array([[0.9953653 , 0.63101383],
       [0.85457193, 0.84978236],
       [0.72397707, 0.89471104],
       [0.64495638, 0.5406186 ],
       [0.91880316, 0.44572439],
       [0.96300064, 0.86155061]])

In [27]:
a.shape

(6, 2)

Try creating a new array b with the same shape as a, filled with any value you like.

In [52]:
#Code here.
b = np.full((6,2),9)

Now we would like to merge the two arrays, a and b, into a single matrix. Assume the columns of a and b represent the same variable, and each of them contains 6 participants. So we'd like to stick b to the bottom of a. You can do this with ```np.concatenate```. See the <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html">documentation</a> to find out how to do it. Store the concatenated array in a new variable, c.

Play with the ```axis``` parameter to see what different shapes you get.

In [56]:
#Code here.
c = np.concatenate((a,b))
c

array([[0.9953653 , 0.63101383],
       [0.85457193, 0.84978236],
       [0.72397707, 0.89471104],
       [0.64495638, 0.5406186 ],
       [0.91880316, 0.44572439],
       [0.96300064, 0.86155061],
       [9.        , 9.        ],
       [9.        , 9.        ],
       [9.        , 9.        ],
       [9.        , 9.        ],
       [9.        , 9.        ],
       [9.        , 9.        ]])

Make another array, d, with shape (3,4). Try concatenating it with array c. Why isn't this working?

In [61]:
# Code here.
d = np.empty((3,4))
d = d.reshape(6,2)
c = np.concatenate((c,d))
c

array([[9.95365304e-001, 6.31013829e-001],
       [8.54571930e-001, 8.49782363e-001],
       [7.23977072e-001, 8.94711037e-001],
       [6.44956377e-001, 5.40618596e-001],
       [9.18803158e-001, 4.45724389e-001],
       [9.63000644e-001, 8.61550611e-001],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323]])

d doesn't match the dimensions of c. Let's say, we recorded our data for d in a slightly different way, using wide format. Now we'd like to change that, to make sure we can stich all the arrays together as desired. The ```.reshape``` attribute of an array can do this for us. Use ```.reshape``` on d (```d.reshape()```) in the above cell, to make it the same shape as a and b. See  the <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.reshape.html">documentation</a> for details. Afterwards, concatenate c and d and save them back in the variable c.

Finally, we want to replace some values in our array c. You can index into the array, and replace values, like this:

In [67]:
c[2,0] = 1 #index into the array: [row,column]
c

array([[1.00000000e+000, 6.31013829e-001],
       [8.54571930e-001, 8.49782363e-001],
       [1.00000000e+000, 8.94711037e-001],
       [6.44956377e-001, 1.00000000e+001],
       [9.18803158e-001, 4.45724389e-001],
       [9.63000644e-001, 8.61550611e-001],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [6.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323]])

Set the 10th item in the first (0th) column to the day of the month in which you were born, and the 3rd item in the second (1st) column to the number of the month.

In [68]:
# Code here.
c[9,0] = 6
c[3,1] = 10
c

array([[1.00000000e+000, 6.31013829e-001],
       [8.54571930e-001, 8.49782363e-001],
       [1.00000000e+000, 8.94711037e-001],
       [6.44956377e-001, 1.00000000e+001],
       [9.18803158e-001, 4.45724389e-001],
       [9.63000644e-001, 8.61550611e-001],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [6.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [9.00000000e+000, 9.00000000e+000],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323],
       [4.44659081e-323, 4.44659081e-323]])

Now we'd like to do some math. Let's find out the mean and standard deviation of our two variables, using ```np.mean``` and ```np.std```. Use the argument ```axis=0``` to specify you want one mean/SD per column. ```axis=1``` will give you one mean/SD per row (maybe you want to get each participant's mean?). Leaving the axis argument unspecified will give you a single mean over the entire array.

In [69]:
#Code here.
print(np.mean(c,axis=0))
print(np.std(c,axis=0))

[3.13229623 3.76015457]
[3.86806582 4.30996283]


Numpy has many more useful functions and applications, and if you want to get deeper into Machine Learning in python, I encourage you to dive further into <a href="http://www.numpy.org/">the numpy documentation</a>. For example, Neural Networks - the current state of the art in many areas of Machine Learning today - are hugely based on matrix multiplications. Even if you are a total champ at Linear Algebra, and could implement them all by yourself, many PhDs have gone into making numpy super fast and efficient, which is why you should always choose numpy instead. Btw, even Deep Learning Frameworks such as <a href="https://pytorch.org/tutorials/index.html">pytorch</a> use similar data types and functions; so understanding them is a very transferrable skill!

For now, we won't need a too in-depth understanding of numpy arrays and functions. Just know that they exist, and that some of the errors you might get are due to an array being of the wrong shape or datatype. Importantly, numpy functions can be applied to pandas DataFrame/Series objects, which we will be using in this course to store and manipulate data.

## 3. Pandas

## 4. Zipf's Law

## 5. Variable relationships