[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/ipml/blob/master/tutorial_notebooks/4_preparing_for_machine_learning_tasks.ipynb) 

# Moving to Machine Learning

<hr>
<br>

This notebook continues our journey through the space of Python programming. At this point, you should be familiar with general programming concepts including variables, data types, and control structures. These concepts re-occur when we are progressing to using Python for machine learning and AI. Specifically, the notebook is to accompany the lecture on the **Foundations of Machine Learning**.

Key topics:
- Libraries for data handling: Numpy and Pandas
- Hands-on case study: Resale Price Forecasting

# It is all about data
Machine learning is a data-driven field. The two most important libraries for handling data in Python are `Numpy` and `Pandas`. If needed, you can install them by uncommenting and executing the following cell.


In [None]:
# !pip install pandas numpy 

# The Numpy Library
`Numpy`is a powerful library for scientific computing. The core data type is the `numpy.ndarray`, which represents a  **tensor**. A tensor is a multi-dimensional array that generalizes scalars (0D), vectors (1D), and matrices (2D) to higher dimensions, used to represent data in mathematics and machine learning. `Numpy` is designed to facilitate fast processing of such higher dimensional data. Let's explore how we can use the `numpy.ndarray` to store and manipulate data.

## Basic data handling 

In [None]:
import numpy as np

# Creating arrays from lists
array_1d = np.array([1, 2, 3, 4, 5, 6])  # 1D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])  # 2D array
array_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])  # 3D array

print("We store data as: ", type(array_1d), "\n")
print("1D Array:\n", array_1d)
print("\n2D Array:\n", array_2d)
print("\n3D Array:\n", array_3d)


Arrays support indexing and slicing, just like lists and other Python containers.

In [None]:
# Indexing
print("\nArray indexing:")
print("Element at index 2 in array_1d:", array_1d[2])
print("Element at row 1, column 2 in array_2d:", array_2d[1, 2])
print("Element at [0, 1, 1] in array_3d:", array_3d[0, 1, 1])

# Slicing
print("\nArray slicing:")
print("First 3 elements of array_1d:", array_1d[:3])
print("Last row of array_2d:", array_2d[-1, :])
print("Slice of array_3d:\n", array_3d[:, :, 0])


They also support logical indexing.


In [None]:
# Logical operations
is_even = array_1d % 2 == 0  # we have seen the modulo operation before
print("Boolean mask for even numbers in array_1d:\n", is_even)
print("Filtered even numbers from array_1d:\n", array_1d[is_even])

## Accessing functionality
Furthermore, the `Numpy` library provides there is a [massive set of functions](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) to work with data of type `np.ndarray`. Here is an example where we use functions to aggregate data.

In [None]:
# Call a Numpy functions on an array
print("Sum of array_1d:", np.sum(array_1d))         # Numpy function to calculate sum
print("Mean of array_2d:", np.mean(array_2d))       # Numpy function to calculate mean
print("Max value in array_3d:", np.max(array_3d))   # Numpy function to calculate max value

Similarly, many functions allow you to reshape or transform your multi-dimensional data. We exemplify this by first reshaping our 1D array into a matrix and then transposing the resulting matrix. 

In [None]:
matrix = array_1d.reshape(3, 2)  # Reshape 1D array to 2D array
print("\nReshaped 1D array:\n", matrix)

# Transpose
matrix_transpose = matrix.T
print("\nTranspose of matrix:\n", matrix_transpose)

Two remarks concerning the programming syntax in the above demo. 

First, note the difference in the Python syntax. Before, we used the syntax `np.some_function(some_ndarray)` to call a function on an `ndarray`. Here, we use the syntax `some_ndarray.some_function(some_arguments)` instead. Both forms are common when working with `Numpy`. They are often exchangeable. Depending on the context, however, one version can be more readable and thus preferable to another. For example, the following lines of code are equivalent but version 1 is considered more readable: 
```Python
np.sum(array_1d)  # Version 1; more readable
array_1d.sum()    # Version 2; equivalent but less readable.
```
Second, the syntax `some_ndarray.T` is often seen in practice, although it is not very readable. It is a commonly used shorthand form for `np.transpose(some_array)`.

Let's next practice the use of `Numpy` by solving some programming exercises.

## Exercises


Create two matrices, $A$, and $B$, as follows:

 $$ A = \left( \begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{matrix} \right) \quad
  B = \left( \begin{matrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \end{matrix} \right)  \quad $$

  Store the variables as a `numpy.ndarray`.

In [None]:
# Code to create variables A, B
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 10]])
A

Perform the following operations.

Note that mathematical operators like `*` might not behave in the way you expect or need it. For example, the result of $A*B$ depends on how you understand the multiplication sign. It could refer to the matrix product but also an element-wise multiplication (i.e. *Hadamard* product). Below, we use the notation $A \cdot B$ to indicate the matrix product. 

  a. Define a variable `a`, set it to some value and calculate $a \cdot A$

In [None]:
a=4
a*A

  b. Calculate $A \cdot B$

In [None]:
A*A

In [None]:
np.dot(A,A)

c. Calculate the inverse of matrix $A$ and store the result in a variable $invA$. Be assured that `Numpy` knows a function to calculate the inverse of a matrix.

d. Multiply $A$ and $invA$ and verify that the result is the identity matrix (i.e. only 1s on the diagonal). You'll probably find that it isn't, because computers usually make very small rounding errors when handling real numbers. 

e. To further investigate these rounding errors, create an identity matrix of suitable size using the `Numpy` function `eye()`. Store the result in a variable `I`. <br>Then compute $ I - A \cdot A^{-1}$. 

f. Fill the first row of matrix $B$ with ones.

g. Access the second element in the third row of $A$ and the first element in the second row of $B$, and compute their product.

h. Multiply the first row of $A$ and the third column of $B$

i. Access the elements of $B$ that are greater than 1 (without looking up their position manually)

j. Access the elements of A in the second column, for which the values in the first column are greater or equal to 4.

# The Pandas Library
Pandas is the *goto* library when it comes to storing tabular data in Python. Like numpy, it provides a core data type - actually two - and a ton of functionality to work with the corresponding data. The first core data type is the `DataFrame`. Think of it as an Excel spreadsheet or a table in a relational database. The second core type is the `Series`, which you can think of as a single column of a table (i.e.`DataFrame`). 

## Creating a DataFrame
It is possible to create a `DataFrame` on-the-fly. Recall our first example of the `Numpy.ndarray`. There, we created the data by converting a `list`: 
```Python
array_1d = np.array([1, 2, 3, 4, 5, 6])  # 1D array` 
```
A similar approach would work for the type `DataFrame`.

In [None]:
import pandas as pd

# List of values
some_data = [10, 20, 30, 40, 50]

# Create a DataFrame
df = pd.DataFrame(data=some_data, columns=['Header'])

print(df)

Several things to note here:
- We specified the column header in our table by setting the argument `columns`. The code would execute without setting this argument. Try it out to see why it is useful to specify a column header.
- In the print out, we see the actual data, our column header, and an index (i.e., leftmost column without header). The index is set by default to a consecutive number. This mimics the behavior you already know from `list` and `numpy.ndarray`. We can access data using an index. In Pandas, we can manually adjust the index if needed. We will see some examples in later parts of the course. 

If creating a `DataFrame` on-the-fly, it is actually more common to use a `dictionary` than a `list` because the key=value paradigm of dictionaries naturally provides column headers. Here is an example.

In [None]:
# Generate a DataFrame from a dictionary
some_dic = {'Name': ['Peter', 'Selina', 'Bruce', 'Natascha', 'Clark', 'Diana'],
        'Age': [22, 21, 25, 22, 23, 22]
}
df = pd.DataFrame(some_dic)
df  # Display the DataFrame. Not using print() is preferable for Jupyter notebooks. The result just looks nicer

## Using DataFrames
We could spend an entire session and more on exploring the functionality of the `Pandas` library, the `DataFrame` and `Pandas.Series`. Generally speaking, many concepts that you know from `Numpy` reoccur. Examples include indexing and slicing but also the way in which you call functions on data stored in a `DataFrame`. To just give one example, try out the following code:
```Python
df["Age"].mean()

```
It is fairly obvious what this code does. More importantly, the example shows the syntactical similarities between `Pandas` and `Numpy`. Specifically, we first index a column by using it's name and we then call a function using dot-notation on the resulting data, that is all the age values stored in the `DataFrame`. Again, we will see many more examples of the functionality in the `Pandas` library while we go along.

In [None]:
# Calculate the mean Age over the DataFrame df



## Loading data
The most important use case of `Pandas` for now is to load data sets that are already stored on a hard disk, in an Excel file, available on the web, etc. We demonstrate this operation for text data stored in *csv* format (for comma separated values). You can load data in this format using a function `read_csv()`. To load data in another common format, `Pandas` knows many more `read_xyz()` functions that work similarly. 

Execute the following code, which load a demo data set from our Github repository.

In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/Humboldt-WI/IPML/main/data/resale_price_dataset.csv'

resale_data = pd.read_csv(url)

# Display a preview of the data
resale_data

Recall the resale price forecasting use case from our last lecture. 

<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/model_based_resale_price_forecasting.png" width="680" height="400" alt="Resale Price Forecasting ">

The data we just loaded is a synthetic data set, which we will use in following exercises to demonstrate a resale price forecasting use case. Take a little time to familiarize yourself with the data. You already saw how you can obtain a preview of the data stored in a `DataFrame`. This is also the purpose of the `Pandas` functions `.head()` and `.tail()`. Further useful functions to obtain a first impression of a new data set include:
- `.shape` (no brackets)
- `.info()`
- `.describe()` 

Use the following code cells to play with these functions.

In [None]:
# Display the first 5 rows of the DataFrame using the head() function

In [None]:
# Display the last 10 rows of the DataFrame using the tail() function

In [None]:
# Display the shape of the DataFrame using the shape attribute  


In [None]:
# Produce a summary of the DataFrame using the info() function  

In [None]:
# Call the describe() function on the DataFrame to get a statistical summary of the data    

## Exercises

### Judgmental vs Statistical Forecasting
We complete this session with an exercise related to our lecture. Recall that we discussed the differences between a judgemental versus a statistical approach toward predicting resale prices of used laptops. The demo data set provides information on observed resale prices in the rightmost column `Observed resale price`. To access this data, you can index the `DataFrame` like so:
```Python
resale_data["Observed resale price"]
``` 

In a similar way, you can access the laptop's original list price, which is available in the column `Retail price` and its age. You find the latter in the column `Actual Lease Duration (months)`. 

**Goal**
Implement a rule-based forecasting model. Your model should implement the business rule:
- Resale price = 50% of the retail price if the actual lease duration > 24 months.
- Resale price = 65% of the retail price if the actual lease duration <= 24 months.


**Task 1**
Write a custom function `business_rule(row)`. Assume this function receives as input a single row of the `DataFrame`. A single row represents one specific laptop. To implement the business rule, you need to use a condition and indexing. For example, to obtain the retail price of the laptop, simply type `row["Retail price"]`. This is how you can use indexing for `DataFrames`. Also, you can perform mathematical operations like 
```Python
resale_data["Retail price"]*0.5
``` 
using indexing.

Your function should return a scalar value representing the rule-based forecast for the specific laptop that the function received as input, which will be either 50% or 65% of its original list price depending on the actual lease duration.

**Task 2**
Call your custom function for every laptop, that is every row of the `DataFrame`. You could write a loop for that but there is a better way. `Pandas` provides a function `.apply()`. You can this function as input the name of another function. The `.apply()` function will then call that other function for each row of the `DataFrame`, which is exactly what we need. Hence, to solve task 2, you can write: 
```Python
resale_data['Rule-based Forecast'] = resale_data.apply(rule_based_forecast, axis=1)  # axis=1 says that we want to apply the function to every row, not every column
```

**Task 3**
Compute the difference between the actual resale price of a laptop and the rule-based forecast. Provided you solved the previous tasks, this is easy. Simply add another column to the `DataFrame` as already illustrated for task 2, and compute this column as 
```Python
... resale_data['Observed resale price'] - resale_data['Rule-based Forecast']. 
```
What would be a better name the difference you just computed?

**Task 4**
Last, execute the following code to create a plot of your calculated differences between observed and predicted resale prices:
```Python
resale_data["Name_of_your_column"].hist()
```

In [None]:
# Codes your your solution
