# Day 1

## Introduction

Python is a Turing complete scripting language, this means it is a very useful tool to solve problems with logical algorithms. Simpler said it tells your computer how to crunch data or do computations for you.

If you are reading this you are most likely a biology student and wish to analyze a few data.
This is what I intend to teach you here. In my personal experience the adoption of  programming languages is not hindered by the amount of “How to code”- literature. The internet is full of different sources that teach you how to code, what they do not teach you is how to think like the people that created the languages or tools you are using. This matters for **you** especially,
because your training as a biologists differs in some fundamental aspects from that of mathematicians, computer-scientists and physicists. These differences in training enable you to understand and investigate biological processes in a way that I could never, but they also hinder you in developing and utilizing *code*. For this reason I will focus more on the way of thinking than on syntax and libraries. Once you have understood the fundamental concepts you should be able to close the gaps using the [official documentation](https://docs.python.org/3/) and [tutorial[(https://docs.python.org/3/tutorial/index.html).

Let me begin by quoting the first paragraph from the [official tutorial[(https://docs.python.org/3/tutorial/index.html).

> Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

I assume this paragraph did not teach you much, which can be attributed to two gaps in your knowledge. The first gap is your vocabulary or your knowledge of definitions. This is knowledge that you can easily acquire by consulting textbooks or Wikipedia. I attempted to mark the terms that are stronger influenced by this here:

> Python is an easy to learn, powerful *programming language*. It has *efficient* *high-level* *data structures* and a simple but effective approach to *object-oriented* programming. Python’s elegant *syntax* and *dynamic typing*, together with its *interpreted* nature, make it an ideal language for *scripting* and rapid *application* development in many areas on most *platforms*.

So understanding these terms is a hurdle, especially because most of the terms are suspiciously familiar, like *efficient* for example. I doubt however that you measure efficiency in *operations* and *memory* consumption. The term *efficient* in this paragraph should be understood in these terms however, but as mentioned before this is not the major challenge. Let me highlight a two more words for you:

> Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a **simple** but effective approach to object-oriented programming. Python’s **elegant** syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

The terms here show the second gap in your knowledge, the way you approach, understand and interact with **problems**. Yes for you a problem is probably a hurdle, an inconvenience, an obstacle, for me it is the fundament of scientific study. For me sciences consists of **problems** and **solutions**, these fit together like puzzle-pieces enabling us to understand and control the world. Only if you have both a **problem** and a **solution** can you publish, if you publish an observation you publish something you believe will help someone else to **solve** a question. So this person has a question or **problem** and  you provide a **solution**. This question-answer-relationship is anchored rather deeply in computer-science, where you as the user want to achieve something, thereby defining a **problem** and then creating your **solution** based on the work of other software-developers.

I chose the previous example to show you the rather alien of thinking. It may seem familiar to you now, but unless you studied a mathematic related subject previously I assure you it is not. To stress this point let us discuss **simple** and **elegant**. What do they mean here? I would claim they refer to a very mathematical concept of perfection, which we may embody by the sentence: “Say precisely enough”. This means everything that needs to be said is said and anything else is not. This is often achieved by very few strict and solid definitions. Python for example was initially lauded for its rather small number of [reserved terms](https://docs.python.org/3.13/reference/lexical_analysis.html#keywords) and [built-in functions](https://docs.python.org/3/library/functions.html). It was considered **simple** and **elegant**, because it use **few** terms and combined them into a bigger system. It is **simple**, because it has **few** constituent parts and **elegant**, because they achieve the larger goal by their interaction. It is in the interaction of the objects, concepts and definitions that beauty can be found and is referred to here, but teaching a course on mathematic and logical aesthetics is beyond me so I will summarize with: It takes roughly 3 years to learn and you are fine hobbling along and ignoring it. What you can not ignore however is the way of thinking around it.

Let me [quote](https://en.wikiquote.org/wiki/Donald_Knuth) Donald Knuth a rather well known computer scientists on what makes a good programmer:

> The psychological profiling [of a programmer] is mostly the ability to shift levels of abstraction, from low level to high level. To see something in the small and to see something in the large. 

What he refers to can be described as the ability to change perspective, to switch from a cellular few to the organ to the organism and to the entire ecosystem depending on the question asked and more importantly to stay quiet about the aspects that do not matter for the question. In other words to **say precisely enough**.

This way of thinking makes great engineers and physicists, as they are able to find the common characteristic for a brown bear, a moose and a sack of grain on the road. The all are an obstacle that slows down traffic. They all have a position on the road, which can be expressed by the distance along it. This information is sufficient to steer traffic away from it and therefore precisely enough for a road manager.

The central mechanism they use to describe things are *equivalence*-classes. *Equivalence* means we can change entities in a context without anything changing. If some one is thirsty it does not matter if we hand them water or ice-tea, therefore water and ice tea are *equivalent* in the context of quenching thirst. Just like a 100 €-bill and a 10 €-bill are equivalent for buying one bag of flour. It does not matter which one you have it buys a bag of flour, at least until inflation ruins this example. An *equivalence*-class describes all things that are *equivalent* within a context. Since a brown bear, a moose and a sack of grain all block the road they are part of the same *equivalence*-class.

So once it is clear what they share they also *decide* to perceive only the characteristics of things they deem relevant. So if something blocks the road they only need to know were to drive. Since the road is a long line they only need to know how far from the beginning of the road the obstacle is placed. The pitfall is already integrated in the example here: Removing a sack of grain and a living animal are two quite distinct tasks requiring distinct equipment. Note that they would not attempt to attach the nature of the obstacle to their description, but prefer to say something like “bear spray required for removal”, since while a bear always requires bear-spray there might be other obstacles that can be removed with bear-spray, so the “bear spray required for removal” is the larger *equivalence*-class.

Now you may wonder why you were trained to think different, then answer is quite simple. This way of thinking makes quite poor observers. If you hand them a piece of paper and tell them to observe a rat for 30 minutes there is a good chance they wrote five sentences: “Rat walked to the far end of the cage. Rat walked to the center of the cage. Rat ran around the cage. Rat walked back to far end of the cage. Rat walked around the cage.” You know that is is a not “precisely enough”, but they only will only know after you begin asking questions.

The correct use is of abstraction is next to a few fundamental mathematical concepts used to abstract problems the difference between a capable and an incapable programmer. The difference between endless despair and flying levity, it is what I hope to demonstrate and teach you in this course.

## Course plan

After the philosophical part let us get an example so we know what we are talking about. Let us assume you have a friend Alice and a friend Bob. Alice is a computer-vision specialists, this means she can find things in images. Bob researches cancer. Bob is interested in the growth of cancer-cells, so he gets five Petri dishes and adds his cell culture and for twelve days he photographs them every day with a microscope. Assume it looks like this image from [wikipedia](https://commons.wikimedia.org/wiki/File:384_microwell_plate_imaged_with_2.5_x_magnification_in_3_channels_with_ZEISS_Celldiscoverer_7_%2830614936632%29.jpg)

![ExampleImage](https://upload.wikimedia.org/wikipedia/commons/6/64/384_microwell_plate_imaged_with_2.5_x_magnification_in_3_channels_with_ZEISS_Celldiscoverer_7_%2830614936632%29.jpg)

Since counting all this cells would take a long time and also risks introducing human mistakes Bob asks Alice for help. Alice runs an computer-vision program over the images and creates a few comma-separated-values-files (.csv) as output, she then goes on vacation. Considering that he has no idea how to evaluate the files Alice sent him Bob approaches you to help him.

So the goal of the course is to take a few comma-separated-values-files and create a few plots to learn something about the data they contain. To begin this process we will first talk about data, **variables**, **values** and **operators**. These are the fundamental building blocks of our code. Afterwards we will talk about **control-structures**, like **loops** and **conditional-instructions**. These allow us to react to our code and process an arbitrary amount of data. We then will talk about **functions**, **classes** and **modules**. These help us organize our code and ensure we only deal with the things we really need to think about. In the end we will use external modules to visualize and analyze the data. This is the final step that helps us answer Bobs questions.

So the course is separated into five parts with separate content and goals:

| Part | Content                     | Goal                                              |
| ---- | --------------------------- | ------------------------------------------------- |
|    1 | Theory and perspective      | You have an idea how programmers think            |
|    2 | Fundamental building blocks | You can add numbers                               |
|    3 | Control structures          | You can write short simple programs               |
|    4 | Functions and classes       | You can write more complex programs               |
|    5 | External modules            | You visualized data and can now learn on your own |

We will follow this example through the course and use it to illustrate the different steps.

## Approaching a problem

So let us begin with the first question: How do we start?
What questions should are relevant?
What do we need to proceed?

Please partner up into groups of two or three people and write down in the next cell what you need to know to begin your plan. Please remember “Say precisely enough”.

Behind this spoiler is my suggested answer. It is not the right answer, but I will claim that it is an right answer, as in there is more than one. Please do not open the spoiler until you are confident that your answer in the cell above is correct.

<details>
  <summary>Click to reveal suggestion</summary>

    I propose that we need to understand the **problem** before we can design the **solution**. So we have to ask Bob what he wants or how he measures growth first and second what Alice already provided us with.

</details>

So first let us ask Bob what he means by growth. “Well the size and the number.”  he answers. What does this mean for us? It means we have to find a way to calculate or visualize the change of the number of cells and the size they occupy in the images.  To do this we first have to calculate it. If we just consider the number if cells we want to have a table describing the number of cells in each Petri dish. So we want to create something like:

| Day | Dish 1 | Dish 2 | Dish 3 | Dish 4 | Dish 5 |
| --- | ------ | ------ | ------ | ------ | ------ | 
|   1 |     12 |     21 |     31 |     15 |     27 |
|   2 |     20 |     39 |     57 |     24 |     62 |
|   3 |     43 |     76 |    112 |     61 |    112 |
|   4 |     87 |    151 |    209 |    119 |    205 |
|   5 |    172 |    299 |    421 |    235 |    398 |
|   6 |    351 |    612 |    871 |    472 |    772 |
|   7 |    721 |   1224 |   1721 |    932 |   1398 |
|   8 |   1404 |   2450 |   3554 |   1791 |   2765 |
|   9 |   2900 |   5011 |   7212 |   3451 |   5132 |
|  10 |   5832 |  10182 |  14781 |   6827 |  10091 |
|  11 |  10915 |  19923 |  28732 |  13001 |  19872 |
|  12 |  19983 |  35871 |  50321 |  25874 |  38762 |

Remember the numbers are made up, they just visualize what we want to achive.

So now that we know where we are going we should ask were we are starting or in other words, what Alice provided. So we took at the files Alice sent an see that they all named similar:

```
Day_1_dish_1_zoom_3.csv
Day_1_dish_2_zoom_3.csv
Day_1_dish_3_zoom_3.csv
Day_1_dish_4_zoom_3.csv
Day_1_dish_5_zoom_3.csv
Day_2_dish_1_zoom_5.csv
Day_2_dish_1_zoom_3.csv
```

It seems like Alice encoded so called **meta-data** into the file names. **Meta-data** describe something that is not described in the data itself, in this case the day of recording, the dish that was used and the zoom-factor of the microscope. This is rather convenient for use, since these data are often provided in separate files or within the file itself, which requires a little extra effort to combine the information again. 

So now let us take a look into one of the comma-separated-value-files to see what data she managed to extract from his images. Assume we see something like this:

Since I made up the example I have a rough idea what they are supposed to mean. In this example wish to show how ambiguous data is often represented and how we can work with it anyway. 

So now that we looked at our inputs and our goal we can begin creating a plan on how to get there. It is now the time to exercise some abstraction and impose a structure on our solution, so I want you to get together in groups of two or three people and write a simple set of instructions to get to the table we defined as our goal. The goal is to turn them into code later, so you should stay abstract enough in your abstractions. Please remember “Say precisely enough”.

Once again, there usually is no perfect solution, not only because we are all human and we all make mistakes, but because the way you approach a problem mirrors the way you think. Considering that the code you write is read by other people, just like the plan you just designed together it should not only follow your way of thinking but also theirs. This means that in software there are solutions that are preferred by convention and tradition. They are used, because they always were and are therefore familiar to most users. This does neither mean they are good nor that you have to follow them, just that other people will recognize them and understand them faster.

<details>
  <summary>Click to reveal suggestion</summary>

What we are doing is essentially counting so:

1. For every file we do the following:
	1. We open the file
	2. We figure out what day and dish it is
	3. We create a counter for the number of cells
	4. We create a counter for the area covered by the cells
	5. We ignore the first line
	6. For every line we do the following:
		1. We increase the cell counter
		2. We add the cell area to the cell-area counter
	7. We save the cell-counter
	8. We save the area counter
</details>

The solution is an algorithm, a set of clear instructions that lead to a defined result. The difference between the algorithms you were taught when you did worked on cell-cultures, or mixed chemicals to prove that something existed within a cell and computer-algorithms is the participation of the executing party. If I tell you to grab a glass of water you and put it on my table you will grab it and out it upside-up with the water on my table, because you inferred what I wanted. To achieve this you used your human experience and your ability to reason about the world to assume “He wants to drink the water later, so I have to ensure it is not spilled.”. Computers can not do this, because they have no human experience and can not reason like you. A **computer is a logical machine**, a predictable system. It has, by design, no ability to act outside the strict parameters it was given. 

To visualize the difference between the way you reason and a computer does imagine a landscape full of hills and creeks. When you think you wander through this landscape you walk up onto a hill and look around, before you explore a little valley and then you may slowly wander to your final destination. Now imagine a large metal ball roughly 3 meter in size and place it on a slope. What will it do? It will roll down the hill, not sideways not upwards just down until it comes to rest. This is how computers reason. If they begin at a certain point in the landscape they always reach the same final solution the same final state. The commands you will give them decide where the hills and valleys are and the parameters where they start. This is why some people tell you, that instructions for a computer have to be very detailed, or in our analogy that the landscape has to very finely crafted with tiny valleys in the slopes a few centimeters wide. This is not correct. Instructions for computer **do not need to be detailed**, but **you have to know how the logic will flow** or in our mental image how the ball will roll.

Neither your program nor your computer will deceive you, purposefully misunderstand your instructions or arbitrarily deviate, only humans have enough freedom in their thinking to do this. You may have the impression that your program or machine learning algorithm do these things, but they just follow their logical path, this means that the burden of fault lies entirely with us, that instruct the machines and in a few rare cases cosmic rays that might flip a bit. For this burden we gain control, it is our choice how the logical-flows of our programs are organized, how they react to their inputs. A small lab-rat may refuse to participate in an experiment or sabotage it, but today even our most powerful computer will do as instructed.

After this little side-trip let us return to the algorithm we wrote. The rhetorical question is: Was this algorithm made for a computer? The answer is not really. It is the set of instruction made for us so that we can teach a computer how to do it.

The first thing this little lists give us are **smaller problems**. “The man who moves a mountain begins by carrying away small stones.” – Confucius. To solve our **big problem** we need to break it up into smaller one. Then we look at the smaller one and repeat the process recursively until the solution is either obvious or someone already solved it. Most problems you encounter were already solved, unfortunately you do not know which ones, so choosing which problem to focus on is challenging. Considering I want to teach you something I will focus on the problem that is in my opinion the most educational at the moment.

## How computers work

The first problem I would like to discuss is increasing a counter, because it teaches us a lot about computers. So let us play bad news, good news and terrible news. Bad news: “ This will be rather theoretical and more difficult to understand.” Good news:”In almost all cases it does not matter”. Terrible news: ”When it matters it will cost you between a few hours and a paper.”. So what am I talking about?

I talk about data representation or how computers “remember” things. You may have heard that computers are all “ones and zeros”. This refers to the way memory works in a computer. Remember the people that built them are always trying to find the simplest solution. So what is the most basic, simplest thing to remember? The answer they arrived at was truth. So the memory of a computer is based on a cell that is either filled or empty, “true” or “false”. Everything is built from this cells. Your phone number, your program and the pictures on your phone are all saved in such cells. This is kind of a hassle, since it is difficult to figure out what which cell actually means. Therefore there are some conventions on interpreting them.

The probably simplest thing to memorize after a True/False or boolean value is a number. You all know decimal numbers like “19”, “15” or “9”. If you recall your earlier education you should remember that the value of a number depends on the number and position of characters. So “19” means calculate 1*10 + 9*1. This becomes relevant because there are simpler system. If you wish to reduce the amount of characters to remember you can simply read “15” as 1*9 + 5*1. You can continue this process until you are left with two characters “1” and “0”. In theory you could reduce further but then you no longer have an elegant way to write larger numbers. Since this system has only two characters is called binary. Now if you paid attention the cells have two states, “filled” and “empty” and we have two numbers “1” and “0”, we can therefore use them to write binary numbers like “1001”, which translates to 1 * (2*2*2) + 0 * ( 2*2) + 0 * (2) + 1 * 1. 

So a computer saves numbers as binary numbers made up of ones and zeros. There is a wrinkle however. Since a computer is a real physical object int can only save a limited number of ones and zeros, so we have to squeeze multiple numbers in the same memory. How do we do this? Simple answer we write them behind each other and remember how long they are. It is convention to define the length of a power of 8. So we have numbers that are 8 cells, 16 cells, 32 cells or 64 cells long. There are also longer numbers but they are uncommon. There are advantages to choosing 8, like its proximity to 10 and the fact that 8 digits in binary correspond to two digits in hexadecimal numbers, but these are not relevant now.

Considering that this was a lot of rather dense information I would suggest you take a break and let the new information settle. Maybe you have same questions that appear later and you would like to have answered. Let us attend to them before we dive deeper into the inner workings of a computer.