# Workshop 3
Many thanks to Theano Xirouchaki for their workshop template! - https://github.com/theo-xir
## Numpy

Numpy is a very useful library that gives an alternative way to work with arrays. It also provides statistical functions and vector and matrix operations. To begin with, let's import it. Note: the "as np" part means we can refer to it as "np" instead of "numpy" for the rest of the file. We are, essentially, creating an alias, or nickname. This library is vital because of how much it speeds up computation by <b>vectorising</b> your data.

In [None]:
import numpy as np

We can create a numpy array as follows:

In [None]:
a  = np.array([1,9,8,3])
print(a)

In [None]:
l = [1,9,8,3]
print(l)
ltoa=np.array(l)
print(ltoa)

Sometimes we might want to apply an operation to an entire list at once. If we try this with our generic python list, we will see it does not work.

In [None]:
print(l+10)

However, it does work with numpy arrays.

In [None]:
print(a+10)

In [None]:
print(a*2)

We can check the shape of an array and the type of its elements.

In [None]:
print(a.shape)
print(a.dtype)

We can also create 2D arrays (and higher, but we're not going to worry about that).

In [None]:
b = np.array([(1,2,3),
              (4,5,6)])
print(b.shape)

We can easily create an array of certain dimensions and initialise all its values to zero as follows.

In [None]:
f = np.zeros((2,3))
print(f.dtype)
print(f)

We can also choose what type we want the elements to be.

In [None]:
print(np.zeros((2,3), dtype=np.int32))

We can similarly create an array with all its values initialised to one.

In [None]:
print(np.ones((2,3), dtype=np.int16))

 We can index numpy arrays with the same logic as generic python arrays.

In [None]:
print(b)

In [None]:
print('First row:', b[0])

In [None]:
print('Second element of first row: ', b[0][1])
print('Second element of first row: ', b[0,1])

In [None]:
print('Second column:', b[:,1])

Note, a big advantage to NumPy is the ability to select columns far more easliy than a 'list of lists' method.

In [None]:
b_bad = [[1,2,3],[4,5,6]]
print(b_bad)

col2 = []
for row in b_bad:
    col2.append(row[1])
print(col2)

We can reshape an array as follows.


In [None]:
d = np.array([(1,2,3,4,5,6,7,8)])
e = d.reshape(4,2)
e

In [None]:
g = np.array([(1,2,3,4,5,6)])
g = g.reshape(-1,2)
g

Recall from last workshop, we can calculate max, min and mean in a similar way. There are two options: funtion or method. Functions take an input and spit out an output e.g. len(list). Methods are a subset of functions that 'belong' to a type of object e.g. list.append(). Both work here, it's just preference. Methods can have extra uses as we'll see in a second.

In [None]:
print(np.min(e))
print(e.min())
print(e.max())
print(e.mean())

To calculate the min/max/mean for each row or column, we have an extra argument for our methods.

In [None]:
print(e.sum(axis=0))
print(e.sum(axis=1))

We can also append elements to a numpy array.

In [None]:
a = np.array([1, 2, 3])
newArray = np.append(a, [10, 11, 12])
print(newArray)

### Exercise 1:

Write a function that, given a number x that is a multiple of 2, returns a numpy list with elements 1,...,x, arranged in rows of 2.

### Exercise 2:

Write a function that creates an x by x identity matrix.

### Exercise 3:

Create a 4x4 NumPy array called 'big_box' with 0s around the outside and a 2x2 square of 1s in the middle. Then, use indexing to select the 2x2 square and assign it to 'little_box'.

We can find the dot product of two vectors.

In [None]:
c=np.array([(1,2,3)])
d=np.array([(4,5,6)])
d=d.reshape(3,1)
np.dot(c,d)

We can also do matrix multiplication.

In [None]:
e=np.array([(1,3),(1,8)])
f=np.array([(4,5),(2,5)])
np.matmul(e,f)

Let's talk a bit about distributions. Numpy allows us to sample probability distributions and has some convenient functions for learning things about out data.

In [None]:
normal_array = np.random.normal(5, 0.5, 5)
print(normal_array)

This takes five samples from a normal distribution with mean 5 and standard deviation 0.5. We can then check various properties of the datapoints we got.

In [None]:
print(np.mean(normal_array))
print(np.std(normal_array))

We can see that the mean and standard deviation of our data is close to that of the distribution they were drawn from, but not identical. Let's see what happens if we increase the sample size.

In [None]:
normal_array = np.random.normal(5, 0.5, 50)
print(np.mean(normal_array))
print(np.std(normal_array))

In [None]:
normal_array = np.random.normal(5, 0.5, 500)
print(np.mean(normal_array))
print(np.std(normal_array))

In [None]:
normal_array = np.random.normal(5, 0.5, 50000)
print(np.mean(normal_array))
print(np.std(normal_array))
normal_array

## Matplotlib
We are now going to have a look at another very useful library, matplotlib. It is inspired by matlab and helps us plot data. Let's have a look at plotting a simple histogram.

In [None]:
import matplotlib.pyplot as plt

plt.hist(normal_array, bins=100)
plt.show()

### Exercise 4:
Use np.random.poisson to sample a Poisson distribution, then plot the data as above. Test different numbers of bins and trials.

We can use np.linspace to generate an array of values from a start value to an end value with a given step.

In [None]:
print(np.linspace(1,10,10))

In [None]:
xs=np.linspace(0, 10, 10)
ys=(xs*2)+5
plt.plot(xs,ys)

In [None]:
xs=np.linspace(0, 10, 10)
ys=xs**2 + xs*3 +4
plt.plot(xs,ys)
plt.show

In [None]:
import math
xs=np.random.randint(0,20,20)
ys=xs**2+xs*3 +4
plt.scatter(xs,ys)
plt.show

We can also change the symbols we use on our plots.

In [None]:
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

There is a lot more we can do with this library! We encourage you to try things for yourself, there is a lot of documentation and examples online.

### Exercise 5:
Just Play around with graphs!

Now that we've done the ground work, let's look at our first simple algorithm. The algorithm we are going to implement is called k-means. First, let's make some data.

In [None]:
xsA=np.random.normal(2, 0.5, 20)
ysA=np.random.normal(3, 1, 20)

xsB=np.random.normal(4, 0.8, 20)
ysB=np.random.normal(3, 0.5, 20)


plt.scatter(xsA, ysA)
plt.scatter(xsB, ysB)

plt.show

We know these are two separate sets of data, as well as their mean and variance, because we generated them. What if we didn't know these things and needed to separate the data? 

In [None]:
xs=np.concatenate((xsA,xsB))
ys=np.concatenate((ysA,ysB))

plt.scatter(xs,ys)
plt.show

One way to go about it is to put them in two "clusters". For our algorithm, we first need to randomly pick two centers for our clusters.

In [None]:
x1=np.random.randint(0,6)
y1=np.random.randint(1,8)

x2=np.random.randint(0,6)
y2=np.random.randint(1,8)

Then, we need to check for each point which cluster it is closer to.

In [None]:
import math

def distance(x1,y1,x2,y2):
    return math.sqrt((x1-x2)**2+(y1-y2)**2)

def findclusters(xs,ys,x1,y1,x2,y2):
    Axs=np.array([])
    Ays=np.array([])
    Bxs=np.array([])
    Bys=np.array([])

    for i in range(len(xs)):
        if distance(xs[i],ys[i],x1,y1)<distance(xs[i],ys[i],x2,y2):
            Axs=np.append(Axs,xs[i])
            Ays=np.append(Ays,ys[i])
        else:
            Bxs=np.append(Bxs,xs[i])
            Bys=np.append(Bys,ys[i])

    plt.scatter(Axs,Ays)
    plt.scatter(Bxs,Bys)
    plt.show
    return(Axs,Ays,Bxs,Bys)
    
findclusters(xs,ys,x1,y1,x2,y2)

Once we've done this for all points we need to recalculate the cluster centers. We do that by taking the average x and y coordinates of each of the two lists.

In [None]:
def average(xs,ys):
    x=sum(xs)/len(xs)
    y=sum(ys)/len(ys)
    return x,y
print(average(xs,ys))

We notice the as we run this, it gives better and better results. Congrats! You now know how to do a fundamental data processing algorithm! 

In [None]:
def kmeans(i,xs,ys,x1,y1,x2,y2):
    for j in range(i):
        Axs,Ays,Bxs,Bys=findclusters(xs,ys,x1,y1,x2,y2)
        x1,y1=average(Axs,Ays)
        x2,y2=average(Bxs,Bys)

In [None]:
i=100
kmeans(i,xs,ys,x1,y1,x2,y2)