# Basics of Python with Jupyter Notebook

Author: Eduard Incze

Date: 27/10/2021

## 1. Introduction
Python is an easy to use but powerful programming language. Over the years, it has gained popularity in the data science crowd leading to many specialised tools being developed in Python. While this tutorial will attempt to cover the basics, many concepts will inevitably have to be left out. A great place to learn more is [the official tutorial](https://docs.python.org/3/tutorial/index.html).

Jupyter Notebooks are hybrid documents that can contain code, outputs, and narrative, organised in cells. They have become almost synonymous with data science, particularly as teaching materials, due to their ability to weave code with immediate outputs and explanations written in markdown. 
Outputs can be printouts or plots, and each cell's outputs display immediately underneath it. Plots can even be made interactive, with the right libraries.

**Note** that by default, outputs are stored in the Notebook so whoever has access to the file can also see these outputs unless they are cleared before sending.

For a more in-depth look at all the things that Jupyter Notebooks can do, there is no better place to start than the [official documentation](https://mybinder.org/v2/gh/ipython/ipython-in-depth/HEAD?urlpath=tree/binder/Index.ipynb).

With all that said, let's begin!

---


## 2. First steps

 ### 2.1 The `print()` function

One of the most used functions in Python is `print()`. Its simplest use is to display a message.

In [33]:
print('Hello world')

Hello world


In Python, free text (such as 'Hello world' above) must be surrounded by quotation marks ("text") or apostrophes ('text'). Either method is valid, as long as it's consistent. This is known as a *string*.

`print()` can have multiple comma-separated *arguments*, which are by default separated by a space.

In [43]:
print("Hello", 'World')

Hello World


Notice in the example above that the two strings had different string-markers surrounding them. This had no effect on the output.

### 2.2 Variables

Variables can be assigned using the `variable = value` syntax.

In [35]:
total = 1 + 1

The value of a variable can be printed using the `print()` function...

In [55]:
print(total)

2


...or by simply calling the variable's name (in a Notebook)

Notice the lack of quotation marks!

In [40]:
total

2

The arguments of `print()` do not have to be of the same type. Here, print is used to display some text and a value:

In [56]:
print('The total is', total)

The total is 2


### 2.3 More on printing with variables

Sometimes, the value of a variable is desired to be in the middle of a sentence, in which case there are several useful methods:

In [89]:
print('There are', total, 'beds left in this ICU')
print(f'There are {total} beds left in this ICU')
print('There are {} beds left in this ICU'.format(total))
print('There are %s beds left in this ICU' % total)

There are 2 beds left in this ICU
There are 2 beds left in this ICU
There are 2 beds left in this ICU
There are 2 beds left in this ICU


### 2.4 Comments
Comments in Python are started with a hashtag.

In [57]:
# Everything after a hashtag is ignored
# It's generally a good idea to comment your code
# Don't over-comment, though
# It's usually more important to get across what your code is trying to achieve, or why it's written in a certain way, rather than how it works

### 2.5 Mathematical operators
Python can be used for various mathematical operations

In [54]:
print('1 + 1 =', 1 + 1) # Addition
print('1 - 1 =', 1 - 1) # Subtraction
print('2 * 3 =', 2 * 3 ) # Multiplication
print('2 ** 3 =', 2 ** 3) # Exponentiation
print('2 / 3 =', 2 / 3) # Division
print('7 // 3 =', 7 // 3) # Floor division
print('7 % 3 =', 7 % 3) # Modulus

1 + 1 = 2
1 - 1 = 0
2 * 3 = 6
2 ** 3 = 8
2 / 3 = 0.6666666666666666
7 // 3 = 2
7 % 3 = 1


### 2.6 Comparison operators
These can be used to determine the relationship between two values. They output a Boolean value of `True` or `False`

In [52]:
print('Is 3 greater than 3?', 3 > 3)
print('Is 6 less than 5?', 6 < 5)
print('Is 1 equal to 1?', 1 == 1)
print('Is 1 not equal to 1?', 1 != 1)
print('Is 3 greater than or equal to 3?', 3 >= 3)
print('Is 3 less than or equal to 4?', 3 <= 4)

Is 3 greater than 3? False
Is 6 less than 5? False
Is 1 equal to 1? True
Is 1 not equal to 1? False
Is 3 greater than or equal to 3? True
Is 3 less than or equal to 4? True


### 2.7 Other operators
These are logical operators (`not`, `and`, `or`), identity operators (`is`, `is not`), membership operators (`in`, `not in`), bitwise operators such as `&` or `|` and augmented assignment operators like `+=` or `-=`.

Operators can be chained to any degree and can result in complex logical statements. The example below outputs `False` because `'a' in 'car' == True`, thus negating it turns the entire statement into a `False` statement.

In [83]:
print(3 > 2 and 4 < 6 and not 'a' in 'car')

False


Be very careful with chaining logical statments! Brackets are an easy way to ensure that statements are evaluated in the order you expect.

### 2.8 Combining variables
Variables can be used together to create new outputs. 

In [59]:
a = 3
b = 1 + 1
c = a / b
print(c) 

1.5


In [60]:
print(f'{a} divided by 1 is {a/1}') # Notice that division always returns a floating point number, regardless of divisibility

3 divided by 1 is 3.0


Some operations only work with certain data types. What happens when you try to add a *string* to an *int*? A *string* to a *string*? Can you multiply *strings* by *ints*?

How you create a new cell largely depends on the method you used to open the Jupyter Notebook. In Spyder and VSCode, there is a '+' button at the top left of the page which inserts a cell below your currently selected cell. By default, newly created cells are code cells; this can be changed to markdown or 'Raw' (used mainly for coding in LaTeX or HTML).

### 2.9 Variable types

We have already seen *strings* ('hello world'), *ints* (1, 0, -3) , *floats* (0.5, -1.2), and *booleans* (True, False).
There are many other types of variables which you'll encounter at various points.

A particularly important type is the *list*. 

**Lists** are ordered collections of other Python objects, which makes them 'compound variables'.

In [110]:
a_list = [1, 6, 9]     #this is a list containing the integer elements 1, 2, and 3
b_list = [1, 6, '9']   #this list looks similar to the first one, but the third element here is defined as a string, not an int.

Any item in a list can be accessed through its index (in Python, indexing starts at 0). The last item in a list has index = -1.

In [113]:
print('The first item in list a is', a_list[0])
print('The second item in list a is', a_list[1])
print('The last item in list b is', b_list[-1])

The first item in list a is 1
The second item in list a is 6
The last item in list b is 9


Notice how the final output looks exactly the same as the others. This can trick you into thinking that '9' is, in fact, an integer. This shows that checking your results using `print()` can be misleading.

The type of any object in Python can be returned through the `type()` function.

In [114]:
type(b_list[-1])

str

### 2.9 Libraries

Python can do many things out-of-the-box, but very clever and kind people have developed *libraries* (also known as modules or packages), which are simply reusable collections of code that can speed up your workflow. Anything an external library can do could technically be done 'by hand' in standard Python, but why reinvent the wheel?

There are several ways to import libraries, for example the entire library can be loaded, or only a part of it. 

The *statistics* library, as its name suggests, contains [basic statistical functions](https://docs.python.org/3/library/statistics.html). 

In [24]:
import statistics as sts # Loading the entire statistics library, under the alias 'sts'. Any functions from this library need to be prefixed with 'sts'
sts.mean([3, 4, 5.5])

4.166666666666667

In [27]:
from statistics import median # Here, the 'median' function is imported directly, with no alias.
median([3,4,5,6])

4.5

**Note**: For best practices, **never** import an entire library without giving it an alias. Doing so might replace any user-defined objects with whatever is contained in the library.

In [36]:
pi = 3.14
pi

3.14

In [39]:
from math import * # This line imports every single function, variable, or other object defined in the math module.

In [38]:
pi

3.141592653589793

The *math* module has its own pre-defined value for pi, which overwrites the user-defined value. If whatever process I'm running depends on the assumption that pi=3.14 exactly, this would lead to unexpected results. In a different scenario, if I needed to calculate the circumference of the known Universe to the width of a hydrogen atom, the number of decimal values in *math*'s pi is just not good enough!

On a more serious note, overwriting of existing objects is not the only problem with the 'import *' method. It also reduces clarity of the code. 

If I have 20 libraries that are each imported using 'import *':
- first of all I won't know whether any of their objects are being overwritten by any others. In this case, the actual order in which the libraries are imported has an effect on the code! 'pi' could have come from *math*, or *numpy*, *scipy*, or *sympy* to name a few. The value of 'pi' stored in the first three libraries is the same, but the one from 'sympy' is a symbolic representation of pi, used in symbolic math and is incompatible with 'regular' math.
- The second problem is that I won't know what bits came from which library, if using a few related libraries! This can drastically reduce the readability of your code, especially if using rather complex functions. Where do you go to find a function's documentation if you don't know which library it's from?

---

## 3. Jupyter dangers 
### 3.1 Running cells out of order
Variables are stored across cells, and do not depend on cells' relative positions.

In the example below, notice that the variable `danger` looks like it has a value *before* it was assigned.
This is because the cells below were not run in the order that they appear in the document.

If you attempt running the cell below first, it will throw an error, because the variable is only defined in the cell below it.

In [5]:
danger

4

In [63]:
danger = 3

In [3]:
danger

3

In [4]:
danger = 4

In [7]:
danger

3

In the original document, the cells above were run out of order, which means that the final value of `danger` was actually 3. For someone else who tries to run this notebook, execution will stop at the first cell with an error; while that can be confusing, it's one of the better outcomes for this sort of mistake. 

**This is one of the most dangerous things about working with Notebooks**. It's also a very common sight, because traditional debugging is not as accessible as in a regular Python file using an IDE. Data scientists end up using cells for testing and debugging and often forget to remove or rearrange those cells.

### 3.2 Variables that self-reference

Let's start with the previously defined `danger` variable. Its value is currently 3:

In [64]:
danger

3

In [70]:
danger = danger + 1

Following the cell above, you'd naturally expect the value of `danger` to be 4 now. However, the output of the cell below shows 5, because the cell changing the value of `danger` was run twice.

In [67]:
danger

5

While this is a very contrived and easy to spot example, it can become a huge liability when working with complex data structures. Generally, try to avoid changing a variable's value outside of the cell where it is defined. A better practice is to pass its value to another variable:

In [68]:
safe = danger

Now, no matter how many times you run the cell modifying `danger`, `safe` stays the same:

In [74]:
print(f'Safe = {safe}, danger = {danger}')

Safe = 5, danger = 7


Unfortunately, there are also exceptions to this. The next section introduces the concept of 'mutability' which illustrates this point.

---
## 4. Intermediate concepts
### 4.1 Mutable and immutable variables
Let's start with an example. In the code cell below, `a` is defined as a *list* containing the numbers 3, 4, and 5. Then the variable `b` is defined as `a`, which should make any changes to `a` or `b` independent of each other. Watch what happens when the last element of `a` is removed using the `pop()` function:

In [75]:
a = [3, 4, 5]
b = a
a.pop()
b

[3, 4]

Furthermore, when `pop()` is used on `b`, `a` is also changed

In [76]:
b.pop()
a

[3]

This happens because *lists* in Python are mutable objects, meaning that their values can change. Numeric types, such as `int`, are immutable. When a variable is defined as `var = 3`, for example, `var` simply 'points' at the number 3, whose value cannot change. You cannot define 3 to be something else. Although `var` itself can be changed to point at a different number, or different object entirely, that has no bearing on the value of 3.

Lists, on the other hand, can be modified, and any variable that points towards a list can be used to modify it. 

This knowledge will come in particularly handy when dealing with *pandas dataframes* on data science projects.

### 4.2 Flow control and indentation

Python code is generally executed sequentially (line by line) from top to bottom. More often than not, however, we would like some pieces of code to be executed multiple times, while others should be ignored except in special cases. This is where the concept of flow control comes in. Flow here refers to the flow of the execution of the program.

The various ways to control the flow of execution include: `if-elif-else` blocks, `try-except-finally` blocks, `while` loops, and `for` loops.

`while` loops, as the name suggests, loop a block of code until a stopping condition is achieved. If a stopping condition is not well defined, these can very easily turn into infinite loops. For this reason, they are often discouraged.

`for` loops iterate over all the items in a given sequence, in the order that they are in. Because most sequences are finite, it is much harder to create an infinite `for` loop (though still possible!).

`if` blocks are only run once (unless nested in a loop). The `elif` and `else` parts are optional. 

**Note** that conditional code that is meant to run in loops or blocks is indented with a *tab* or, the most common alternative, 4 *spaces*. Indented code makes it clear to Python that only that code should run if the condition is successful. A return to non-indented code signals an exit from that conditional.

It's easiest to explain using an example, so see below:

In [97]:
a = -2
lives = 1

if a > 5:                                                   # First, the 'if' statement checks whether a condition is fulfilled (a>5 here)
    print("That's a big number")                            # If the condition is fulfilled, then this code is executed
elif a == 5:                                                # Otherwise, the program checks the first 'elif' statement (short for else if)
    print("Your number is a little too high")               # If the condition is met, this code is executed
elif a == 0:                                                # You can have as many 'elif' statements as you'd like
    print('Your number is zero')
    print('It needs to be a positive number')
elif a < 0:                                                 # But having too many of them is usually a sign that a different method might be better
    print("Negative numbers are not allowed")
    lives = lives - 1                                        # All indented code under the 'elif' is executed
    if lives == 0:                                          # Including more if statements. Careful! Excessive nesting is also a bad sign
        print("You have lost")                              
else:                                                       # Finally, if none of the other conditions are fulfilled, 'else' is a catch-all
    print("Hurray, your number is between 0 and 5")

Negative numbers are not allowed
You have lost


In [115]:
for i in [0,1,2,3,4,5]:          #using the list [0,1,2,3,4,5] to iterate over. Alternatively, could use range(6).
    print(i)

0
1
2
3
4
5


In [116]:
i = 0
while i < 6:
    print(i)
    i += 1

0
1
2
3
4
5


### 4.3 Functions
We've already used some built-in functions, like `print()` or `type()`, but custom functions can also be created. This is usually done to make code more concise, easier to change, and easier to document.

Let's tackle ease of change. You may need certain blocks of code to be used several times, perhaps with different inputs.

**Example**: Automatic tasks depending on the day of the week

Say you wish to write a message every day, depending on what day of the week it is. This message could be for yourself, as a sort of reminder written to a .txt file, or it could be an automatic email sent to certain people in Outlook, using the [pyOutlook](https://pyoutlook.readthedocs.io/en/latest/index.html) library (Note: This is **not** a suggestion to sent automatic emails out to colleagues every morning!).

Perhaps you'd like to download a list of your upcoming meetings for the week on Mondays and compile timesheets based on calendar entries for the past week on Fridays (as shown [in this article](https://ajabbitt.medium.com/automating-outlook-calendar-downloads-with-python-5fb3671b56a)). Maybe even calculate [statistics for time spent in meetings](https://pythoninoffice.com/get-outlook-calendar-meeting-data-using-python/).

In [3]:
# Only run if Monday
message = "Good morning dear colleagues! I got a bad case of the Mondays!"
print(message)
upcoming = '' # Code for getting upcoming meetings

# Only run if Tuesday
message = 'Good morning dear colleagues! Today is a lovely Tuesday!'
print(message)

# Only run if Friday
message = 'Good morning dear colleagues! Cannot wait for the weekend!'
print(message)
timesheets = '' # Code for getting meetings of past week
statistics = '' # Code for calculating statistics


In most Jupyter Notebook configurations, you cannot run part of a cell, so the cell would have to be split or the individual bits of code would have to be joined using if-else statements. This works 'ok', and there are other improvements that can be made, like saving the 'good morning' part of the message separately (which makes it easy to modify for all days at the same time). But what if you want to stop sending the message altogether? That would need to be added as another condition, and this code just gets bigger and bigger.

In this example, a function containing arguments to determine which day of the week it is would be more concise and easier to change. As a bonus, you can add another parameter that can decide whether the messsage should be sent or not.

**Note** that just like for flow control statements, indentation shows what code is meant to be part of the function. Any non-indented code following the function definition will be considered outside of the function.

In [12]:
def automated_tasks(dotw='Monday', send_message=True): 
    # The parameters dotw and send_message are given default values. These values can be changed when calling the function manually, as seen at the end
    if dotw == 'Monday':
        message = 'I got a bad case of the Mondays!'
        upcoming = '' # Code for getting upcoming meetings
    elif dotw == 'Tuesday':
        message = 'Today is a lovely Tuesday!'
    elif dotw == 'Friday':
        message = 'Cannot wait for the weekend!'
        timesheets = '' # Code for getting meetings of past week
        statistics = '' # Code for calculating statistics
    if send_message==True:
        print("Good morning dear colleagues!", message)
    
automated_tasks(dotw='Friday', send_message=True)

Good morning dear colleagues! Cannot wait for the weekend!


This is similar to 'hard-coding' a formula cell in Excel versus using cell references, which make it a lot easier to modify the behaviour of your calculations.

Even better, we can automate even the input of the day of the week, by importing the `datetime` library and calling `datetime.datetime.today().strftime('%A')`

In [46]:
import datetime
today = datetime.datetime.today().strftime('%A')
print(today)

Thursday


Another use-case for functions is when you need a certain output which can be obtained after a number of steps, but the intermediary values are not useful. That's because functions are self-contained. Any variables defined within the scope of a function are not accessible outside of the function, unless specifically *returned* by the function. When *return* is used, calling the function returns a value. This value can be assigned to a variable or printed. **Note** that function execution stops immediately after reaching the *return* statement. Only use it when you wish the function to exit.

See below for a very inefficient manual calculation of the greatest common divisor of two numbers:

In [55]:
def complicated_calc(numb1, numb2):
    numb1_divisors = [i for i in range(1,numb1+1) if numb1 % i == 0]
    numb2_divisors = [i for i in range(1,numb2+1) if numb2 % i == 0]
    common_divisors = [i for i in numb1_divisors if i in numb2_divisors]
    gcd = max(common_divisors)
    return gcd

complicated_calc(15,20)

5

The only value we're interested in here is the gcd itself; the lists of divisors and the list of common divisors are superfluous and do not need to be returned.

## 5. Conclusion
Congratulations on reaching the end! 

While this tutorial has barely scratched the surface of what Python can do, hopefully it has given you enough of a basis to start looking out for other, more specific resources.
Here are some links to get started with:

Constantly updating lists of open source libraries: [one](https://github.com/ml-tooling/best-of-ml-python) and [two](https://github.com/josephmisiti/awesome-machine-learning#python)

This [subreddit](https://www.reddit.com/r/learnpython) and [post](https://www.reddit.com/r/learnpython/comments/gk517f/data_analysis_resources_for_python/) specifically.

[Tons](https://jakevdp.github.io/PythonDataScienceHandbook/) of [free](http://neuralnetworksanddeeplearning.com/) [books](https://greenteapress.com/wp/think-bayes/) (search for 'free python data science books' for more)


The next Notebook in this series is aimed at dataframes. <link See you there!