# COM4509/6509 - Lab 0: Introduction to Python, Juypter and Numpy

Each week you will have a lab notebook to work through,
- It is anticipated that the notebooks will take longer than the hour of the lab. Please finish in your own time.
- Feel free to work with your friends on the lab notebooks!

This notebook introduces the tools we will be using.

## Python and Jupyter

You are currently looking at a **jupyter google colab notebook**, that uses python. If you've not coded in python before, you should spend a few hours getting used to it. See the [official python tutorial](https://docs.python.org/3/tutorial/index.html). Jupyter notebooks are an interactive way of coding, making notes, plotting graphs etc. They are useful as a tool for doing exploratory data analysis, but in real projects it is usually necessary to eventually build a python module with proper class structures, etc.

## Getting it running on your computer

The easiest way to get going with python, jupyter and various libraries useful for machine learning is to install the [anaconda distribution](https://www.anaconda.com/download/success). For more info on getting the jupyter notebook started on your own computer follow the instructions [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/). You could also use google colabs, although that can be awkward and sometimes easy to lose changes, but only requires a web browser.

## Intro to python, jupyter notebooks, numpy.

We'll be coding in python, using juyter/colab notebooks, like this one.

The content is in different 'cells'. Some are text (like this one) and some are code, like the next one:

In [1]:
print("I'm a "+"code"+" cell!")

I'm a code cell!


You can edit cells by double clicking on them and run the code cells by click the "Run cell" arrow (on the left hand side if you mouse over the cell). Try clicking on the run cell arrow in the cell above.

**Tip: if you have a code cell selected Ctrl+Enter runs it, and Shift+Enter runs it and moves onto the next cell.**

We use libraries when coding, that contain lots of useful tools and functions. A key one for doing machine learning is **numpy**. We can import it like this:

In [2]:
import numpy as np

We can then use functions from it like this:

In [3]:
np.cos(np.pi)

np.float64(-1.0)

## Intro to numpy

We can use it to define vectors, matrices, etc...

- A vector is a list of numbers
- A matrix is a table of numbers

For example, we might have a list of the grades of the class:

In [4]:
grades = np.array([54,34,57,22,84,93,14,45,75,55,84,51,62,44])

- this is a vector that's 14 long, and is one dimensional. Let's display it: (the last thing in a cell has its output printed).

In [5]:
grades

array([54, 34, 57, 22, 84, 93, 14, 45, 75, 55, 84, 51, 62, 44])

We can ask for the 'shape' of this variable:

In [6]:
grades.shape

(14,)

We can find out what the mean of the grades is:

In [7]:
np.mean(grades)

np.float64(55.285714285714285)

### Slice Notation

We can get just the first item using:

In [8]:
grades[0]

np.int64(54)

Notice that the first item has index ZERO (not one).

We can get the last item using -1:

In [9]:
grades[-1]

np.int64(44)

There's useful syntax called slice notation. For example:
- if we want the first five items, we use `grades[:5]`
- we we want the last three items we use `grades[-3:]`
- if we want those at indexes 1,2 and 3: `grades[1:4]` - this is more tricky, have a think about what's going on.
- if we want just the ones at 1,3,5,7,9...etc we can use: `grades[[1,3,5,7,9,11,13]]` or we can also tell it how to step through `grades[1::2]` means it steps through two at a time, starting at index 1.
- if we want just the odd ones within the first 10, we can use `grades[1:10:2]`.

In [10]:
grades[:5]

array([54, 34, 57, 22, 84])

In [11]:
grades[-3:]

array([51, 62, 44])

In [13]:
grades[[1,3,5,7,9,11,13]]

array([34, 22, 93, 45, 55, 51, 44])

We have another vector that tells us which students are MSc students.

In [14]:
#grades = np.array([54,34,57,22,84,93,14,45,75,55,84,51,62,44])
msc = np.array([True,False,False,False,True,True,False,False,True,False,True,False,True,False])

We can index the grades array using this array:

In [15]:
grades[msc]

array([54, 84, 93, 75, 84, 62])

And get the average of these:

In [16]:
np.mean(grades[msc])

np.float64(75.33333333333333)

We could put them into a separate variable,

In [17]:
mscgrades = grades[msc]

What's the average of the other students?

In [18]:
np.mean(grades[~msc]) # ~ means 'not'

np.float64(40.25)

What proportion of those on the BSc programme are over 40%?
We can do a boolean comparison to each element in the vector. This will make a new vector with Trues and Falses in:

In [22]:
#grades = np.array([54,34,57,22,84,93,14,45,75,55,84,51,62,44])
#msc = np.array([True, False, False, False, True, True, False, False, True, False, True, False, True, False])
grades[~msc]>40

array([False,  True, False, False,  True,  True,  True,  True])

We are able to ask for the mean of this as it treats 'True' as '1' and 'False' as '0', so the mean will give the proportion that are above 40:

In [23]:
np.mean(grades[~msc]>40)

np.float64(0.625)

In [24]:
attendance = np.array([0.5,0.6,0.4,0.4,0.9,0.9,0.2,0.3,0.8,0.5,0.9,0.5,0.6,0.4])

### Matrices?

(note: in numpy nowadays we just use `array` for everything, including vectors, matrices and higher-order tensors).

We might have a table that has *both* the grade and if they're on the MSc programme, rather than two separate vectors.

We can build a matrix from our vectors, using this syntax:

In [35]:
import pandas as pd

#this means we want to join columns together
students = np.c_[grades,msc,attendance] #this means we want to join columns together
students #written here to make it print, note that it's converted the boolean Trues/Falses into 1s and 0s - as all the types must be the same in an array.

# index and column names
row_names = ["Row_0", "Row_1", "Row_2", "Row_3", "Row_4", "Row_5", "Row_6", "Row_7", "Row_8", "Row_9", "Row_10", "Row_11", "Row_12", "Row_13"]
column_names = ["grades", "msc", "attendance"]

# Giving names to rows and columns
res = pd.DataFrame(students, index=row_names, columns=column_names)
print("Result:\n", res)

Result:
         grades  msc  attendance
Row_0     54.0  1.0         0.5
Row_1     34.0  0.0         0.6
Row_2     57.0  0.0         0.4
Row_3     22.0  0.0         0.4
Row_4     84.0  1.0         0.9
Row_5     93.0  1.0         0.9
Row_6     14.0  0.0         0.2
Row_7     45.0  0.0         0.3
Row_8     75.0  1.0         0.8
Row_9     55.0  0.0         0.5
Row_10    84.0  1.0         0.9
Row_11    51.0  0.0         0.5
Row_12    62.0  1.0         0.6
Row_13    44.0  0.0         0.4


What shape is students?

In [32]:
students.shape

(14, 3)

This is in the order mathematicians are used to: Number of rows, Number of columns: So there are 14 rows and 3 columns.

We can ask for the grade of student at index=4 down:

In [36]:
students[4,0] #this asks for row number 4, column number 0

np.float64(84.0)

If we want the attendance of students between rows 4 and 8,

In [37]:
students[4:9,2] #as it's 4-8 inclusive, we need to ask for 4:9. The '2' tells it that we want column number 2.

array([0.9, 0.9, 0.2, 0.3, 0.8])

We've been told that to correct an error, the undergrad students are all going to get an extra 3 marks added. How can we do that?

We can pick out the rows that are undergrads like this.

1) The students that are undergrads are found at:

In [38]:
students[:,1]==0

array([False,  True,  True,  True, False, False,  True,  True, False,
        True, False,  True, False,  True])

2) So we can pick the grades out:

In [46]:
students[students[:,1]==0,0] = students[students[:,1]==0,0] + 3 #this will change the students table

print(students)
students.shape
## !!! If you run this cell repeatedly you'll add 3 repeatedly!!

[[54.   1.   0.5]
 [58.   0.   0.6]
 [81.   0.   0.4]
 [46.   0.   0.4]
 [84.   1.   0.9]
 [93.   1.   0.9]
 [38.   0.   0.2]
 [69.   0.   0.3]
 [75.   1.   0.8]
 [79.   0.   0.5]
 [84.   1.   0.9]
 [75.   0.   0.5]
 [62.   1.   0.6]
 [68.   0.   0.4]]


(14, 3)

The above illustrates an important lesson for coding notebooks: Try to make a new variable, rather than update old ones. It would have been better to have written something like,

```
corrected_students = students.copy()
corrected_students[students[:,1]==0,0] = students[students[:,1]==0,0] + 3
```

### Numpy array 'copy' vs 'view'

This also reveals a key issue in python (and other modern languages), the variable `students` is effectively a pointer (a 'view') to the memory containing the table of numbers, if we hadn't used `.copy()` the `corrected_students` array would 'point' to the *same table of numbers*. You can see this if you run,
```
a = np.array([1,2,3])
b = a
b[:] = b[:] * 100
print(a)
```
Looking inside `a` you'll find that the values have all been scaled by 100.

Although this is a bit confusing, it's good to just be aware of, as it's a common cause of errors for students new to python! To learn more have a quick read of [Copies and Views](https://numpy.org/doc/stable/user/basics.copies.html).

## Using numpy to get probabilities!

Time to revisit some of the lecture info...

- what is the probability of a student having a grade over 60? The estimator is simply the proportion of students that do:

In [47]:
np.mean(students[:,0]>60)

np.float64(0.7142857142857143)

- What is the probability of a student attending more that 0.65 of the lectures?

In [48]:
np.mean(students[:,2]>0.65)

np.float64(0.2857142857142857)

What's the probability of a student getting a grade over 60 AND attending more than 0.65 of the lectures?

In [None]:
np.mean((students[:,0]>60) & (students[:,2]>0.55))

0.35714285714285715

Oh, let's stop and just look at what we did there...

The first part `(students[:,0]>60)` generates a boolean vector of those with a mark over 40:

In [None]:
students[:,0]>60

array([False, False, False, False,  True,  True, False, False,  True,
       False,  True, False,  True, False])

The second part `students[:,2]>0.65` does the same for those who attended lots:

In [None]:
students[:,2]>0.65

array([False, False, False, False,  True,  True, False, False,  True,
       False,  True, False, False, False])

We can use the `&` operator to make a new boolean vector from these two, with True when both of the others are true.

**Tip: The brackets are often necessary to ensure that we get the correct order of precedence.**

In [None]:
(students[:,0]>60) & (students[:,2]>0.65)

array([False, False, False, False,  True,  True, False, False,  True,
       False,  True, False, False, False])

So, does it seem likely that these are independent? Let's check:

In [None]:
np.mean(students[:,0]>60) * np.mean(students[:,2]>0.55)

0.15306122448979592

0.15 is much smaller than 0.36, so it seems like they are not independent.

This doesn't mean that it is causal! It might just be that good students do the worksheets, read the text book and go to lectures.

If we knew if they had done the worksheet, it might be that the grade and lecture attendance are independent GIVEN whether they had done the worksheet.

## Next

Now you've got the hang of python and numpy we can start on the lab notebook itself!