# Programming Assignment 01: A Solid Start to Programming

<h1 style="position: absolute; display: flex; flex-grow: 0; flex-shrink: 0; flex-direction: row-reverse; top: 60px;right: 30px; margin: 0; border: 0">
    <style>
        .markdown {width:100%; position: relative}
        article { position: relative }
    </style>
    <img src="https://gitlab.tudelft.nl/mude/public/-/raw/main/tu-logo/TU_P1_full-color.png" style="width:100px; height: auto; margin: 0"\>
    <img src="https://gitlab.tudelft.nl/mude/public/-/raw/main/mude-logo/MUDE_Logo-small.png" style="width:100px; height: auto; margin: 0"\>
</h1>
<h2 style="height: 10px">
</h2>

*[CEGM1000 MUDE](http://mude.citg.tudelft.nl/): Week 1.1. Due: Thursday, Sep 7, 2023.*

## Overview

Topics in this assignment include:

1. [Importing a few commonly used Python packages and checking that they are up-to-date](#Task-1:-Set-up-Python-Packages)
2. [Confirming that you set up your **working directory** properly](#Task-2:-Getting-our-files-sorted-and-imported)
3. [Python code: investigating data](#Task-3:-Investigating-the-Data)
4. [Python code: visualizing the data](#Task-4:-Visualizing-the-Data)
5. [Golden Rule 1: Use Descriptive Names](#Task-5:-Golden-Rule-No.-1)
6. [Markdown: get your point across clearly](#Task-6:-Markdown-and-self-expression)


It will probably be easy if you have used Python before. If you have little to no Python experience, you should refer to the [online textbook](http://mude.citg.tudelft.nl/book) and online course [Python for Engineers](https://tudelft-citg.github.io/learn-python/) when you encounter topics that are unfamiliar to you.

We encourage you to work together with a classmate to complete this assignment, but please _make sure you write the code and execute the notebook cells yourself on your own computer!_ This will get you more comfortable with Python and Jupyter notebooks sooner.

Prior to attempting this assignment, you should have completed all of the [Getting Started](http://mude.citg.tudelft.nl/software/getting_started/) steps on the software page of the course website.

This week you are not required to submit the assignment anywhere, and it will not be part of your project portfolio grade. Where necessary there are feedback tools incorporated into the notebook to check your answers.

## Task 1: Set up Python Packages

As described in the MUDE textbook, we often use **packages** (written in the Python language) to extend the functionality that comes with the base installation of Python. For this assignment we will need the following packages:

| Packages | Conventional Usage | Why do we need it? |
| :---: |:---: |:---: |
| `numpy` | `np` | Numerical computations, especially as vectors. |
| `matplotlib.pyplot` | `plt` | Making plots. |
| `mude.week1` | `mude` | Makes it easier for you to learn programming in MUDE, and will help check your answers. |

Remember that you should have already installed these packages on your computer using `conda` or `pip` using a command line interface (terminal) on your computer: refer to the [course website](https://mude.citg.tudelft.nl/software/packages) for instructions.


<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 1.1:</b>   
Import the required Python packages as specified in the table above.
</p>
</div>

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 1.2:</b>   
Verify that you imported the package properly, and also installed a suitable version by executing the following method: <code>mude.check_environment</code>.
</p>
</div>

In [None]:
mude.check_environment()

## Task 2: Getting our files sorted and imported

This programming assignment will create our very first MUDE model, based on a small data set `data.csv`. The cell below has been prepared for you and will import the data into the notebook environment, then save it as two Python objects, `data_x` and `data_y`.

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 2.1:</b>   
Import the data by running the cell below (note that you will probably need to visit Task 2.2 if it does not work. The code is also explained step-by-step further below).
</p>
</div>

In [None]:
data_x, data_y = np.genfromtxt("auxiliary_files/data.csv", skip_header=1, delimiter=";").T

Did it work? Unless you followed the installation instructions carefully, you probably got an error when running the previous cell. That's OK---try executing the method `mude.help_task_2`

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 2.2:</b>   
Troubleshoot Task 2.1 with <code>mude.help_task_2</code> until you properly import the data. (you don't need to understand what this function is doing---it's just something we make to help you check your work)
</p>
</div>

In [None]:
mude.help_task_2(globals())

Confused by the line of code importing the data above? Let's break it down!
* using the method `genfromtxt` in the `numpy` package,
* access the contents of file `data.csv`...
* ...which are separated by the `;` delimiter,
* skip the first row (header),
* take the transpose (`.T`)
* then save the output in two variables, `data_x` and `data_y`

Actually the saving of the output into two variables is a cool feature of Python called tuple unpacking---there's a note about this later on. For now, notice that an extra line would have been needed if we did not have this feature:
```
data = np.genfromtxt(...) <-- this returns a tuple
data_x, data_y = data     <-- which we then use to separate our x and y data
```

**Note:** many of you have probably used the package `pandas` before, and might be wondering why we didn't use it here. `pandas` is a great tool for data analysis, as it can quickly load in tabular data, automatically identifies which delimiters are used, skips header rows, and much more. On the other hand, it can be a bit more difficult to use, and in this assignment we are only using the data for very simple computations. So for now, be patient: we will use `pandas` later in Q1!

## Task 3: Investigating the Data

You've now successfully imported the data from `data.csv`, but what was inside it? What data do we have?


<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 3.1:</b>   
Find the file in your file explorer or IDE and open it (you can use Excel, but it's easier to just open it in a text editor; if you aren't sure how, ask a colleague or TA), then answer the questions on the cards below.
</p>
</div>

In [2]:
%%html
<iframe
  src="https://tudelft.h5p.com/content/1292050206829217497/embed"
  height="415"
  width="100%"
>
</iframe>

Since we've already imported the data into our notebook, we could have figured this out without opening the `csv` file. This is easy to do since Numpy returns to us "numpy arrays". These arrays are **objects**. Objects are simply wrappers around some data (e.g. array size, data inside array, etc.), and functions (e.g. calculate mean of the array). The data is the object's _attributes_, and the functions are its _methods_. In Python everything is an object, including constructs such as numbers, strings or lists (for more information you could read [this blog](https://realpython.com/python-mutable-vs-immutable-types)). When using code you'll often make custom objects by using classes. Classes like `np.ndarray` (N-dimensional array) are blueprints for objects, but need to be "instantiated" to create specific objects called _instances_. That's why there's only one class `np.ndarray`, but you can make arrays of all shapes and sizes which are independent of each other. We will learn more about classes and objects later in the semester; for now you should focus on the fact that as engineers we are constantly working with Python objects, which always have a set of attributes and methods that can make our lives easier.    

Below are some attributes of ndarrays you can test with `data_x` and `data_y`:

- `instance.shape`
- `instance.size`

`shape` will give us the dimensions of the ndarray, whereas `size` tells us about the number of elements.


<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 3.2:</b>   
Print the shape and size of our data. Does the output match your expectations based on your investigation of the csv file?
</p>
</div>

In [None]:
print(f"The shape of our data is: {data_y.FILL_IN_ATTRIBUTE_HERE}")
print(f"The size of our data is: {data_y.FILL_IN_ATTRIBUTE_HERE}")

Now that we have investigated some _attributes_ of ndarray, let's use _methods_ of the objects: we can try and compute the mean and standard deviation of the data we imported. We'll need to use methods of `data_y`, but how can we find what they are? Good package documentation always lists this information for the objects that it provides; that of ndarrays can be found [here---take a look!](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) (you don't need to understand everything on this page right now, as long as you recognize that there are a _lot_ of attributes and methods for these objects!). Here is a suggestion for how you can approach this:
* If in doubt about the type of your data (i.e., what kind of object it is), try using inspecting it using `type(...)`
* As you might have seen on the Documentation page, you can find the mean using the method `.mean()`---pretty logical, right?
* See if you can compute the standard deviation (*Hint*: sometimes it's abbreviated as "std")

Some of these suggestions have been typed in the cell below for you.

Another way to see all the attributes and methods available in an object is to use the `dir` function. This gives a lot of information you don't need to deal with, however, such as internal data or methods. You will also see strange method that start and end with 2 underscores, those are called "dunder" methods---an example is `__add__` a special dunder method that gets called when adding two instances together, which allows you to implement custom addition behaviour for your class instances. Don't worry about understanding this now, we will return to it later in the semester. If you don't want to see the output of `dir()` any more, simply comment the command by placing a `# ` at the beginning of the line and rerunning the cell.

_Note: you aren't expected to completely understand some of the more computer science-focused aspects of this notebook just yet, for example, the dir function, dunder method or tuple unpacking. We will be revisiting these concepts frequently throughout the semester to help you slowly grow familiar with them and increase your understanding of Python and programming in general._ 

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 3.3:</b>   
Complete the cell below to compute the standard deviation, and play with the other attributes and methods of the ndarray! Make sure you remember that unlike attributes, methods need a parenthesis at the end to be evaluated.
</p>
</div>

In [None]:
print(f"Type of data_y: {type(data_y)}")

print(f"Mean of data_y: {data_y.mean()}")

print(f"Standard deviation of data_y: {data_y.FILL_IN_METHOD_HERE}")

print(dir(data_y))

## Task 4: Visualizing the Data

Now that we know what kind of data we have, we want to plot it and find a **model** to represent the observed behaviour. This has been set up in the cells below. See how simple it is to make a plot with Python?

In [None]:
plt.plot(data_x, data_y, 'ok')

Next we want to fit a line to the data...



<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 4.1:</b>   
Without reading further, study the code in the cell below. Can you tell what is happening? Try writing down what each line does and compare with a classmate. Then read through the explanation provided.
</p>
</div>

In [None]:
A = np.array([data_x, np.ones(len(data_x))]).T

[m, b], [rs], _, _ = np.linalg.lstsq(A, data_y, rcond=None)

print(f"y = {m} * x + {b}")

R_squared = 1 - rs / np.sum((data_y - np.mean(data_y))**2)
print(f"R²: {R_squared}")

<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Solution:</b>   
Make an honest attempt at Task 4.1 before reading further.
</p>
</div>

So what is this code doing? Let's go through it line by line:
1. First we set up a matrix `A`, the structure of it looks like the first column containing the `data_x` vector and another column which is all ones, of the same dimension. If `data_x` has 5 data points, it would look like this:
$$
\begin{equation}
\begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ x_4 & 1 \\ x_5 & 1 \end{pmatrix}
\end{equation}
$$ 
To see why `A` is set up this way, you'll need to read the [documentation](https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html). The online documentation also has a useful search feature, so keep it in mind when working with Numpy! Going back to `lstsq`, the documentation states that it solves the matrix equation:
$$
\begin{equation}
Ax = B
\end{equation}
    \tag{2}
$$
We have two columns in our `A` matrix, so `x` will have two rows. We're trying to solve for two things, the slope and intercept of a line, so this checks out. Remember the equation of a line looks like this:
$$
\begin{equation}
y = mx + b 
    \tag{3}
\end{equation}
$$
If we use Eqn. 1 and Eqn. 3 to expand Eqn. 2, we get: 
$$
\begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ x_4 & 1 \\ x_5 & 1 \end{pmatrix}
\begin{pmatrix} m \\ b \end{pmatrix} = 
\begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \end{pmatrix}
$$
Take a moment to convince yourself that solving this is equivalent to solving for the line!

2. Next we call the actual function, but `lstsq` returns a lot of information, some of which we don't care about and some of which we want to split up. We can use a technique called *tuple unpacking* to separate all the return values into their own variables (we already used this when importing our data from the csv file!). When a Python function returns multiple data, they are packed as a single tuple, e.g. `(return1, return2, return3)`. Unpacking simply means splitting this up again like this: `r1, r2, r3 = (r1, r2, r3)`. We can split further if one of the return values is itself a tuple or a value that can be unpacked (like a Numpy array), by using square brackets or parenthesis: `r1, (nr1, nr2) = (r1, (nr1, nr2))` or `r1, [nr1, nr2] = (r1, (nr1, nr2))`. Finally, if we don't care about a return value, we can use an `_` to ignore it. You still have to include the underscore, even if it's the last time you're throwing away, otherwise Python will think you're trying to unpack the wrong number of values.   
3. Here we're using an *f-string* to print a message with the variables embedded in the string. Python offers other ways to format strings with information, but *f-strings* are one of the most flexible and legible.
4. We calculate $R^2$ using the formula:
$$
R^2 = 1 - \frac{\sum{(\hat{y}_i - y_i})^2}{\sum{(y_i - \bar{y})^2}}
$$
`lstsq` already returns the top expression, the sum of residuals squared, so we calculate the bottom expression. Numpy arrays define special behaviour for the `-`, `+`, `@`, etc. operators (remember those dunder methods?). In this case, substracting a scalar from the vector will subtract it element-wise. Squaring is also element-wise. Take some time to visualize the transformations on the Numpy array, and how they result in the bottom expression!

**Interpretation of $R^2$:** mathematically speaking, this metric is the complement of the mean of the squared errors divided by the variance of the observations. In other words, it is the percentage of variance explained by the model; a high number means that the model (in this case a line) accounts for more of the variation in the data, as opposed to other sources, such as randomness. If this concept is not completely clear to you, don't worry: we will revisit it later in the semester.

## Task 5: Golden Rule No. 1

Let's try this again but we'll use a function from the `mude` package that will illustrate [Golden Rule 1](https://mude.citg.tudelft.nl/programming/golden_rules.html#rule-1-use-descriptive-names). This function use names that describe the _meaning_ of their data in a more clear way.

_Note: this Task is meant to explicitly illustrate the effect of descriptive names in being able to more easily read and understand code; it is not meant to imply that the function `lstsq` is not useful---it is far more powerful than our simple `mude` function!_

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 5.1:</b>   
Use the function provided to find the slope and intercept of the least squares line through the data. The help function will print out the documentation which will describe its intended useage.
</p>
</div>

In [None]:
help(mude.fit_a_line_to_data)

In [None]:
[slope, intercept] = mude.fit_a_line_to_data(FILL_IN_ARGUMENTS_HERE)
print(f"y = {slope:0.3f} * x + {intercept:0.3f}")

As you can see, the new version is a lot easier to understand! We substract a lot of complexity on our end since we don't have to think about matrices. On the other hand, if you wanted to confirm that the method was doing the right computations, you would have to dig into the source code, so there are trade-offs. In general, try and find functions that meet your needs specifically.

Let's see if we can also easily interpret the mathematics implemented in the source code:

```python
def fit_a_line_to_data(x, y):
    """ Fits a line of best fit (using least squares) to data points
    Arguments:
        x (numpy array): x-coordinates of points
        y (numpy array): y-coordinates of points
    """
    
    x_mean, y_mean = np.mean(x), np.mean(y)
    slope = np.sum(np.multiply((x - x_mean), (y - y_mean))) / np.sum(np.square(x - x_mean))
    intercept = y_mean - slope * x_mean
    
    return (slope, intercept)
```

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 5.2:</b>   
Answer the two questions in the flip card below (you'll need an internet connection to see them---and also an old fashioned pencil and paper for the first one!!!).
</p>
</div>

In [1]:
%%html
<iframe
  src="https://tudelft.h5p.com/content/1292049265499022097/embed"
  height="415"
  width="100%"
>
</iframe>

## Task 6: Markdown and self-expression
Now we are going to use our model of the data (the regression line), and you'll answer some questions.

In [None]:
def line(x):
    return slope * x + intercept

linspace_x = np.linspace(0, np.max(data_x))

plt.plot(data_x, data_y, "ok")
plt.plot(linspace_x, line(linspace_x))
# plt.xlabel("")
# plt.ylabel("")
# plt.title("")
plt.show()

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 6.1:</b>   
    <p>Complete the following exercises to get comfortable with the code above:</p>
<ol>
    <li> Change the labels on the axes of the graph, and give it a nice title. </li>
    <li> Change the colour of the points from black to blue, and the line colour from blue to red </li>
</ol>
</div>

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 6.2:</b>   
Predict the y-value of the line at x=5 (you can check the csv to see if it's close). Note that the line isn't perfect, so the data won't match exactly.
</p>
</div>

In [None]:
# TYPE YOUR CODE HERE TO FIND y at x=5

Now we would like to quickly communicate our results; imagine you will be sending this notebook to a classmate or a teacher and you really want them to be able to see the final answer easily, while also being able to understand how you arrived at it. We can do this quick easily using Markdown cells in the notebook (of course if you haven't realized it yet, this cell you are reading is a Markdown cell).

Markdown is a _markup language_ that makes it possible to write richly-formatted text in Jupyter notebooks (see the course website for more information and examples). We will use Markdown on a weekly basis in our MUDE project reports, so this is a good chance to use it for the first time, if you never have. Use the cell below to write your answer to Q1, using the following formatting tips:
- Write out lists by beginning the line with a hyphen `- my item`, or a number and dot `1. first item`
- You make text **bold** by using `**double asterisk**` and *italics* with `*one asterisk*`.
- `Highlight` code-related words or other important concepts using back-ticks `` `like this` ``
```python
message="You can also make multi-line code blocks with three back-ticks
extra_message="Provide a language after the first backticks to get syntax highlighting
# e.g. 
#```python
# this explanation
# ```
print(message)
print(extra_message)
```

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 6.3:</b>   
Use the Markdown cell below to document your answers from Tasks 6.1 and 6.2.
</p>
</div>

### You answer for Q1 in Task 6: include a title

Write some explanation of what you did, as well as the steps:
- step 1
- step 2
- ...

You can `format the command you used` as code. And also write the answer using LaTeX-style math notation, both inline (like this: $x=5$) or as a separate block, like this:
$$
y = m \cdot x + b
$$

That's it for this week!

**End of notebook.**
<h2 style="height: 60px">
</h2>
<h3 style="position: absolute; display: flex; flex-grow: 0; flex-shrink: 0; flex-direction: row-reverse; bottom: 60px; right: 50px; margin: 0; border: 0">
    <style>
        .markdown {width:100%; position: relative}
        article { position: relative }
    </style>
    <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">
      <img alt="Creative Commons License" style="border-width:; width:88px; height:auto; padding-top:10px" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" />
    </a>
    <a rel="TU Delft" href="https://www.tudelft.nl/en/ceg">
      <img alt="TU Delft" style="border-width:0; width:100px; height:auto; padding-bottom:0px" src="https://gitlab.tudelft.nl/mude/public/-/raw/main/tu-logo/TU_P1_full-color.png"/>
    </a>
    <a rel="MUDE" href="http://mude.citg.tudelft.nl/">
      <img alt="MUDE" style="border-width:0; width:100px; height:auto; padding-bottom:0px" src="https://gitlab.tudelft.nl/mude/public/-/raw/main/mude-logo/MUDE_Logo-small.png"/>
    </a>
    
</h3>
<span style="font-size: 75%">
&copy; Copyright 2023 <a rel="MUDE Team" href="https://studiegids.tudelft.nl/a101_displayCourse.do?course_id=65595">MUDE Teaching Team</a> TU Delft. This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.