In [2]:
import numpy as np
import pandas as pd

def printline(n):
    print()
    print(f"{n} -------------------------------")
    print()

# Numpy

### Introduction

Numpy, short for Numerical Python, serves 1 quintessential purpose for data science applications- it provides fast, efficient n-dimensional arrays implemented in C, with support for many essential operations on these arrays.

Regular python arrays are quite good, but they aren't as efficient as they could be and that could negatively affect performance, especially for large applications. The efficiency gains of numpy arrays are not immediately visible with a small array, but they are much more efficient for large and multidimensional arrays.

### Basic Numpy Constructs

Numpy offers a single basic data structure- ```np.array```.
it is written in C and provides many memory optimizations which make it miles more efficient than a python array.

Most importantly, numpy arrays are stored contiguously in memory- that is, there are no gaps in the memory storage, you could access adjacent entries in the computer's memory and see the items side by side. An n-dimensional index (e.g., ```x[1][2][3]``` gives the address ```[address of x] + (1 + 2 + 3) * sizeof(int)```). This makes them easily copyable and easy to operate on. 

There are other numpy data structures such as ```np.matrix``` but ```np.ndarray``` will be most of what you see.

All the numpy operations in this tutorial are specifically for dealing with numpy arrays. Numpy arrays will now be referenced as ndarrays or just arrays.

In [None]:
# python array- slow and inefficient for large arrays
x = [[1, 2, 3], [1, 2, 3]]
printline(1)
print(x)
print(type(x))

# numpy array- fast, efficient for all sizes of arrays
# note on dtype- it is int32 by default, but you should ALWAYS ALWAYS specify
x = np.array([[1, 2, 3], [1, 2, 3]], dtype=np.int32)
printline(2)
print(x)
print(type(x))

# numpy provides lots more support for multidimensional arrays (these work for
# single dimensional arrays too)
print(x.shape)
print(x.dtype)
print(x.ndim)

# indexing syntax
print(x[0, 0])
print(x[0][0])

# note: the subarrays are also numpy objects now!
print(x[0])
print(type(x[0]))

# some conventional python constructs work with ndarrays:
print(len(x))
for item in x:
    print(item)

### Essential Fields

There are a couple of essential ndarray fields:

- `dtype`: type of the object the ndarray contains. dtype can be any of the following:
    - predefined type from the numpy library
    - predefined native python type
    - custom type defined by the user (e.g., user-defined class)

  These types are strictly enforced, unlike normal python arrays.
- `ndim`: the number of dimensions of the ndarray, will always be at least 1.
- `shape`: the actual dimensions of the ndarray. Stored as tuple within the ndarray object. The order of the dimensions is extremely important- they are evaluated in terms of array nesting.

__A note on the syntax of shape:__ when printing 1-dimensional shape (or when interacting with numpy using other libraries such as tensorflow), you may see output like this: (3,). The weird part is the comma after the 3- it's a 1-item tuple so there's no business having a comma there. The comma doesn't actually mean anything.

In [None]:
x = np.array([1, 2, 3], dtype=np.int16)
printline(1)
print(x.shape)

x = np.array(
    [[1.5, 2.3, 3.4],
     [4.5, 5.6, 6.7]], dtype=np.float64)
printline(2)
print(x.shape)

x = np.array(
    [[[x ** y for x in range(1, 3)] for y in range(1, 3)],
     [[x ** y for x in range(1, 3)] for y in range(3, 5)],
     [[x ** y for x in range(1, 3)] for y in range(5, 7)]], dtype=np.double)
printline(3)
print(x.shape)

In [None]:
# quiz: what's the shape of the following arrays?
x = np.array(
    [[1, 2],
     [1, 2],
     [1, 2]])

y = np.array([[[1], [1], [1], [1]],
              [[1], [1], [1], [1]],
              [[1], [1], [1], [1]],
              [[1], [1], [1], [1]]])

z = np.array(
    [[[[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]],
      [[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]],
      [[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]]]])

print(x.shape, y.shape, z.shape)

### Reshape operation

Reshape allows you to change the shape of any array. You have to be a bit careful when using it though- the new shape could rearrange elements in potentially unintended ways!

There is only 1 rule to remember when using shape: __the product of the new dimensions must be exactly the total number of array items.__ Here items refers to the total number of non-array objects in the array.

### Special case- reshaping with -1

it is possible to enter a single reshaping dimension with -1. This is assumed to be an unknown dimension that numpy infers for you. A little less work to do :)

This is also the *one and only* exception to the rule above- you'll see that the ouputted dimensions for a reshape operation with -1 to follow this rule!

In [None]:
x = np.array(
    [[1, 2],
     [1, 2],
     [1, 2]])

# total number of elements remains equal
printline(1)
print(x.reshape(2, 3))
print(x.reshape(1, 6))
print(x.reshape(6, 1))

z = np.array(
    [[[[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]],
      [[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]],
      [[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]]]])

printline(2)
# dimensionality reduction
print(z.reshape(6, 6))
# dimensionality increase
print(z.reshape(2, 3, 2, 3, 1))
# using -1
print(z.reshape(3, -1, 6))
print(z.reshape(3, -1, 6).shape)

### Various operations

Numpy provides various operations which can be applied to arrays:

* `filter (via condition in the [] operator)`
* `order`
* `sum`
* `max`
* `mask (via a conditional expression)`
* `transpose`
* `sort (note: happens in-place)`
* `any`
* `all`
* `concat`

These are all properties of the `np.ndarray` object (except `concat`).

See the documentation to see which are part of the array structure and which are not: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

### Overloading default operations

Ndarrays can also use some default operations, such as `+`, `-`, etc. These have mathematical behaviour though- `+` will take the vector sum of the arrays, and `-` and such act accordingly.

### Behaviour with dtypes

The above operations can work on arrays with different dtypes provided they are compatible. The dtype of the result is chosen automatically to facilitate all items of the result.

In [None]:
z = np.array(
    [[[[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]],
      [[13, 14, 15], [16, 17, 18]],
      [[19, 20, 21], [22, 23, 24]],
      [[25, 26, 27], [28, 29, 30]],
      [[31, 32, 33], [34, 35, 36]]]], dtype=np.int64)
print(z.shape)

# filter
printline(1)
print(z[z % 6 == 0])

# sum
printline(2)
print(z.sum())

# max
printline(3)
print(z.max())

# mask
printline(4)
print(z % 3 == 0)

# transpose
printline(5)
print(z.transpose())
print(z.reshape(6, 6).transpose())
# notice how the shape changes for transpose- it's inverted
print(z.transpose().shape)

# sort
printline(6)
z.sort()
print(z)

# any/all
printline(7)
print((z % 3 == 0).any())
print((z % 3 == 0).all())

# concat
printline(8)
a = np.array(
    [[[[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]],
      [[13, 14, 15], [16, 17, 18]],
      [[19, 20, 21], [22, 23, 24]],
      [[25, 26, 27], [28, 29, 30]],
      [[31, 32, 33], [34, 35, 36]]]], dtype=np.int64)

# notes
# - arrays must be passed on as a tuple
# - arrays must have same shape
print(np.concatenate((z, a)))

# infix operations
printline(9)
print(z + a)
print(z - a)
print(z * a)
print(z / a)

### Array axes

Arrays can also be configured via their axis properties. Recall the output of an example shape command:

```
>>> z = np.array(
    [[[[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]],
      [[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]],
      [[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]]]])

>>> print(z.shape) 
(1, 6, 2, 3)
```

Here, we see the array has 4 dimensions. In Numpy, these dimensions are 0-indexed- that is, the size of axis 0 is 1, the size of axis 1 is 6, and so on. Numpy operations can be applied on certain axes or all axes, using the ```axis``` default parameter.

### Operation behaviour with axes

Numpy operations can be applied to some or all axes. This may cuase the operation to flatten or alter the shape of the array to facilitate the operation. Additionally, you may end up with an array in response to an operation that you may expect to provide you with a number.

Some operations may broadcast the operation across all axes, giving a return array with a completely new shape. This includes adding arrays with `+`. This only applies in certain situation though- in other you may get an error.

The rule for computing the new shape is as follows: 

if we have 2 arrays `a` and `b`, where `a` has shape `(a1, a2, ..., an)` and `b` has shape `(b1, b2, ..., bn)`, and the result of `a [] b` is `c` with `(c1, c2, ..., cn)`:

for each `i` in `1` to `n`:

- if `ai` == `bi`, then `ci` = `ai` = `bi`.
- if `ai` == 1 != `bi` then `ci` = `bi`.
- if `bi` == 1 != `ai` then `ci` = `ai`.
- if `ai` != 1 != `bi` then `error`.

In [None]:
z = np.array(
    [[[[19, 20, 21], [22, 23, 24]],
      [[25, 26, 27], [28, 29, 30]],
      [[13, 14, 15], [16, 17, 18]],
      [[31, 32, 33], [34, 35, 36]],
      [[1, 2, 3], [4, 5, 6]],
      [[7, 8, 9], [10, 11, 12]]]], dtype=np.int64)
print(z.shape)

# filter
printline(1)
print(z[z % 6 == 0])

# sum
printline(2)
print(z.sum())
print(z.sum(axis=1))

# max
printline(3)
print(z.max())
print(z.max(axis=2))

# mask
printline(4)
print((z % 3 == 0).any(axis=1))

# sort
printline(6)
z.sort()
print(z)
z.sort(axis=1)
print(z)
z.sort(axis=2)
print(z)

# any/all
printline(7)
print((z % 3 == 0).any(axis=3))
print((z % 3 == 0).all(axis=2))

# concat
printline(8)
a = np.array(
    [[[[7, 8, 9], [10, 11, 12]]],
     [[[1, 2, 3], [4, 5, 6]]]], dtype=np.int64)
print(a.shape)

# infix operations
printline(9)
print(z + a)
print(z - a)
print(z * a)
print(z / a)

print((z + a).shape)
print((z - a).shape)
print((z * a).shape)
print((z / a).shape)

### Generative functions

Numpy also offers some operations for generation of arrays:

* `linspace`
* `arange`
* `rng`

These arrays can be reshaped as desired to become useful, and manipulated exactly the same as regular arrays.

In [None]:
printline(1)
print(np.arange(32))

printline(2)
print(np.linspace(5, 50, 24, dtype=int))

printline(3)
from numpy.random import default_rng

rng = default_rng()
values = rng.standard_normal(10000)

print(values[:100])
print(values[:100].reshape(10, 10))

# Pandas

### Introduction

Pandas also provides you with a single core data structure- the dataframe. If the `np.array` is analogous to a python list, then `pd.DataFrame` is (sort of) analogous to a python dictionary. 

Pandas is also built for scalability, but specifically for 2-dimensional data. You can think of a pandas dataframe like an SQL table, as they come complete with index numbers and column names!

### Basic constructs

There is only 1 we care about- the dataframe, i.e. `pd.DataFrame`. Dataframe act like SQL tables- their individual rows are identified by indices, and their columns by column names. Records can be individually picked out or filtered by row and column.

### Copy attribute

The `copy` attribute of dataframe creation is very important- it specifies if the dataframe is created on a copy or a reference to the source item. Copy is set to `True` by default. If it is set to `False`, then any changes to the source item will be progagated to the dataframe. Any changes to the dataframe will be propagated to the source object. *Note: this can only be specified when using a numpy array as input.*

In [None]:
# dataframe creation- several ways of creating the same dataframe

# note: the items beings lists is important
a = {"a": [1], "b": [2], "c": [3]}
x = pd.DataFrame(a)

a2 = (["a", "b", "c"], [[1, 2, 3]])
y = pd.DataFrame(a2[1], columns=a2[0])

# pandas numpy integration
a3 = np.array([[1, 2, 3]], dtype=np.int32)
z = pd.DataFrame(a3, columns=["a", "b", "c"], index=[0])

printline(1)
print(x)
printline(2)
print(y)
printline(3)
print(z)

### Dataframe creation and shape/axis

Dataframe creation is 1 crucial application of shape- pandas will form the database based on the shape *implied* by your combination of data, columns, and indices. This is not the same as the actual shape.

Take the second dataframe above. If we used `[1, 2, 3]` instead of `[[1, 2, 3]]`, we actually would have gotten an error. This is because pandas wants min. 2d data- the shape of `[1, 2, 3]` is `(3,)` which is 1d. The shape *implied* by entering `[1, 2, 3]` as the data is `(3, 1)` (3 rows and 1 columns) since 3 items are provided on the 0th axis. The shape implied by entering `[[1, 2, 3]]` as the data is `(1, 3)` since we have 3 items on the 1st axis. 

The formula for the implied shape is `(length of indices, length of columns)` which is in fact `(size of axis 0, sise of axis 1)` of your data, unless you specify the columns and/or index parameters.

You may need to pad arrays accordingly before turning them into dataframes.

### Parts of a dataframe

There are several key parts of a dataframe:

- `index`: row identifiers
- `columns`: column identifiers
- `ndim`: number of dataframe dimensions (should always be 2)
- `size`: number of entries
- `shape`: shape of the data
- `values`: the actual values in the dataset
- `axes`: analogous to numpy axes

These variables are all either gettable, settable or both- when getting them, however, they may be wrapped in pandas objects. Only index and columns can be set. Rows and column values can be edited, but values itself cannot be set.

In [None]:
# show gettable and settable

a = np.array([[1, 2, 3], [6, 5, 4], [7, 8, 9], [12, 11, 10]], dtype=np.int32)
z = pd.DataFrame(a, columns=["a", "b", "c"], index=["a", "b", "c", "d"])

print(z)
print(z.index)
print(z.columns)
print(z.values)
print(z.ndim)
print(z.size)
print(z.shape)
print(z.axes)

# notable types:
print(type(z.index))
print(type(z.columns))
print(type(z.values))

z.index = ["z", "y", "x", "w"]
z.columns = ["pandas", "is", "cool"]

print(z)

### Getting/Setting Values

Getting and setting the actual values can be done in several ways. The most common is to simply dereference using the [] brackets. This allows us to do simple lookups, such as a row, or a column, or a single item.

For more complex lookups, such as several rows and a single column, we need to use the special functions loc and iloc.
- `loc`: gets the value based on the index and column provided.
- `iloc`: gets the value based on the integer index and column provided (0-indexed)

Both these can be used to get entire rows and columns as well. They are also compatible with python list syntax as well, such as slicing.

### Series object

Dataframe rows and columns are not stored as ndarrays- rather, they are stored as `pd.Series` types. These internally contain ndarrays containing the items, but the series wrapper allows for some more pandas-specific functionality. For instance, pandas series can have an index. Some attributes of pandas series:

- `index`: same as DataFrame index
- `data`: values
- `dtype`: analogous to numpy dtype
    - `dtype` can have a special value called object- this refers to any non-standard type, e.g. a user-defined class.
- `copy`: same as DataFrame copy

In [None]:
a = np.array([[1, 2, 3], [6, 5, 4], [7, 8, 9], [12, 11, 10]], dtype=np.int32)
z = pd.DataFrame(a, columns=["a", "b", "c"], index=["a", "b", "c", "d"])
print(z)

# location is presented as (index(es), column(s))

printline(1)
z.loc["a", "a"] = 100
print(z)

z.iloc[0, 1] = 200
print(z)

print(z["a"])
print(z[["b"]])
print(z[["b", "c"]])
try:
    print(z["a", ["b", "c"]])
except Exception as e:
    print(type(e), e)

printline(2)
# get row
print(z.loc["a"])
print(type(z.loc["a"]))
# get column
print(z.loc[:, "a"])
print(type(z.loc[:, "a"]))

printline(3)
# notice the subtle change in syntax here- it makes a major difference!
# get rows
print(z.loc[["a", "b"]])
print(type(z.loc[["a", "b"]]))
# get columns
print(z.loc[:, ["a", "b"]])
print(type(z.loc[:, ["a", "b"]]))
# get specific columns and rows
print(z.loc[["c"], ["a", "b"]])
print(type(z.loc[["c"], ["a", "b"]]))
# set specific columns and rows
z.loc[["c"], ["a", "b"]] = [300, 400]
print(z)
z.iloc[[3], [0, 2]] = [500, 600]
print(z)

### What can you put in a dataframe?

Just about anything (including other dataframes)!

Values can be inserted individually, so in this sense, a dataframe behaves like a dictionary. For example, you can say `z.loc["d", "e"] = 4` if `z` has no row `"d"` or column `"e"`. Values can also be inserted as rows or columns.

### A note on append

Pandas was recently updated to version 2.0. In 1.x, there was a method called append which made it very easy to add rows to your dataframe. This method *no longer works*. Many stackoverflow threads will tell you to use it. **You cannot.** Try the methods below instead!

### A note on series

Let's say we try to insert a series into a dataframe. In this case, *only* the items corresponding to an index equivalent to a column or index name will be added to the dataframe. The series does *not* have to be the size of a row/column of the dataframe. If the index is not specified, the row/column will be created but none of the items will be added.


In [None]:
a = np.array([[1, 2, 3], [6, 5, 4], [7, 8, 9], [12, 11, 10]], dtype=np.int32)
z = pd.DataFrame(a, columns=["a", "b", "c"], index=["a", "b", "c", "d"])
print(z)

printline(1)
z.loc["d", "e"] = 5
print(z)

z.loc["f", :] = [13, 14, 15, 16]
print(z)

printline(2)
b = pd.Series(np.array([17, 18, 19, 20], dtype=np.int32), index=["w", "x", "y", "z"])
z.loc["g", :] = b
print(z)

b = pd.Series(np.array([17, 18, 19, 20, 21, 22, 23, 24], dtype=np.int32), 
              index=["w", "a", "x", "b", "y", "c", "z", "e"])
z.loc["g", :] = b
print(z)

# no effect
b = pd.Series(np.array([25, 26, 27, 28], dtype=np.int32))
z.loc["g", :] = b
print(z)

### Copy semantics

*Recall: copy can only be specified when using a numpy array as input.*

If a dataframe has copy set to `True`, but a series-turned-row of the dataframe has copy set to `False`, which takes precedence?

The dataframe's copy settings always take precedence.

In [None]:
printline(1)
a = np.array([7, 8, 9])
b = np.array([[1, 2, 3], [4, 5, 6]])

z = pd.DataFrame(b, copy=True)
print(z)

a = pd.Series(a, copy=False)
z.loc[0] = a
print(z)

a[1] = 100
print(z)

printline(2)
a = np.array([7, 8, 9])
b = np.array([[1, 2, 3], [4, 5, 6]])

z = pd.DataFrame(b, copy=False)
print(z)

a = pd.Series(a, copy=True)
z.loc[0] = a
print(z)

b[1][1] = 100
print(z)

### Operations on dataframes

There are many operations that can be applied to a dataframe. By default, they are applied to the whole column or row specified (though you can't specify a row for all of these). Here are some essential operations:

- `astype` [row]: changes the dtype of a row
- `sort_values` [row, column]: sort rows by column(s) specified or columns by row(s) specified.
- `filter` [row, column]: filters rows and columns, via bitwise syntax, e.g. & | ~. The method `filter` does not exist in dataframes, it is done in square brackets [].
- statistics 
    - `describe`: gives full insights into the data
    - `mean` 
    - `std`
- `iterrows`: allows iteration over rows
- `items`: allows iteration over columns
- `apply`: applies a specified function to all columns or rows, cannot remove/add rows in this way.

### Common error with filter

`ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().`

This error occurs if you don't put brackets around every condition within your expression. A condition is an expression of the form `(dataframe[row/column select] (condition) (value))`.

In [None]:
a = np.array([[1, 2, 3], [7, 8, 9], [6, 5, 4], [12, 11, 10]], dtype=np.int32)
z = pd.DataFrame(a, columns=["a", "b", "c"], index=["a", "b", "c", "d"])
print(z)

printline(1)
z.loc[:, "a"] = z.loc[:, "a"].astype(np.float64)
print(z)
print(z.dtypes)

printline(2)
z = z.sort_values(("a"))
print(z)

printline(3)
print(z[z["a"] == 1])
# when applying filtering over both rows and columns- values are not removed
print(z[(z["a"] != 1) & (z[["b"]] > 5)])
# notice the brackets around EVERY SINGLE part of the expression!
print(z[((z["b"] >= 8) & (z["b"] <= 12)) | ~(z[["c"]] % 2 == 0)])

printline(4)
z.describe()

printline(5)
for index, row in z.iterrows():
    print(index)
    print(row)

printline(6)
for label, column in z.items():
    print(label)
    print(column)

printline(7)
print(z.apply(lambda x: x * 2))
print(z.apply(np.sum, axis=0))
print(z.apply(np.sum, axis=1))

printline(8)
def test_fn(x):
    print(type(x))
    print(x)
    return x[x >= 5]
    #print(x)

print(z.apply(test_fn, axis=0))
print(z.apply(test_fn, axis=1))

### Missing values

Pandas has built in support for missing values. This is critical when importing dataset that don't contain all their values. Numpy has a special object, comparable to Python's `None`, called `np.nan` (short for Not a Number), used widely in Pandas. When printing a dataframe, it shows up as `NaN`. There are several methods which help support this:

- `isnan`: determines if an entry is nan
- `fillna`: fills all NaNs with the specified value(s). You can specify a single value, a dictionary, a series, or a dataframe to fill the values. When not specifying a single value, values are filled in according to index (if dictionary or series) and columns (if dataframe).
- `dropna`: drops all rows/columns with at least 1 NaN value.

If you intialize a dataframe with indices and columns but no values, all values will be filled with NaN by default.

In [None]:
z = pd.DataFrame(columns=[1, 2, 3, 4], index=["a", "b", "c", "d"])
print(z)

printline(1)
z.loc["b",] = [5, 6, 7, 8]
z.loc["c",] = [5, 6, 7, 8]
z.loc[:, 2] = [5, 6, 7, 8]
z.loc[:, 3] = [5, 6, 7, 8]
print(z)

printline(2)
a = z.copy(deep=True)
a = a.dropna(axis=0)
print(a)
a = z.copy(deep=True)
a = a.dropna(axis=1)
print(a)

printline(3)
a = z.copy(deep=True)
z = z.fillna(0)
z = a.copy(deep=True)
x = pd.Series([1, 2, 3, 4], index=[1, 5, 6, 7])
z.fillna(x, inplace=True, axis=0)
print(z)

# Example with dataset

### Dataset explanation

Next time! But you can take an initial look at the dataset here:

https://www.kaggle.com/code/headsortails/wiki-traffic-forecast-exploration-wtf-eda

### Workshop itenerary

- dataset exploration and preprocessing with pandas and numpy, dataset plotting and visualization with matplotlib (workshop 2)
- time series forecasting- 3 different methods using scipy, scikit-learn, and tensorflow (workshop 3)