In [None]:
%pip install numpy scipy scikit-learn plotly pandas matplotlib
%pip install --no-cache-dir --force-reinstall https://dm.cs.tu-dortmund.de/nats/nats25_00_03_python_packages-0.1-py3-none-any.whl
import nats25_00_03_python_packages

# Introduction
## Python (Packages)

In this notebook, we will have a look at the most important packages, that you will need from the get go.
These are [`numpy`](http://www.numpy.org/), [`pandas`](https://pandas.pydata.org/), [`scipy`](https://www.scipy.org/), [`sklearn`](http://scikit-learn.org), [`plotly`](https://plotly.com/python/) or [`matplotlib`](https://matplotlib.org/).

### Import

To use a package in Python code, you can use the `import` keyword followed by the package name.
If you want to import only a part or a specific method from a package, you can use the `from` keyword as shown below.
In case you want to rename something you have imported, append the `as` keyword, followed by an alias.

In [None]:
import pandas
import numpy as np
from plotly import graph_objects as go

### numpy

The probably most important package for computing in Python is numpy.
It enriches the normal Python capabilities by faster vectors and matrices (numpy.array) and gives you access to most mathematical functions, that you will ever want to use on vectors and matrices.
In addition to that, numpy functions are written and compiled in C and only linked into Python, which makes them quite a bit faster.

To create a vector or matrix, you can simply call the numpy.array constructor with a list or tuple as argument.
Lists of lists (Python native matrices) will automatically be parsed into a matrix.

In [None]:
import numpy as np
myMat = [[3*i+j for j in range(3)] for i in range(3)]
print("Python matrix:\n",myMat)
npMat = np.array(myMat)
print("Numpy matrix:\n",npMat)

To create numpy arrays without creating Python lists in advance, you can call one of the many constructors from the numpy package:

In [None]:
print("Zeros matrix:\n",np.zeros((3,2)))
print("Ones tensor:\n",np.ones((2,2,2)))
print("Diagonal matrix:\n",np.diag([1,2,3]))

Indexing of numpy arrays is similar to lists, but instead of multiple brackets, you can write the indices in one bracket separated by commas.
This notation supports slicing.

In [None]:
# numpy.arange behaves very similar to range but returns an numpy.array
# reshape transforms an array into a different shape; in this case from 25x1 to 5x5
mySecondMat = np.arange(25).reshape((5,5))
print("Full matrix:\n",mySecondMat)
print("Central element:\n",mySecondMat[2,2])
print("Submatrix:\n",mySecondMat[1:4,1:4])

You can also index numpy arrays with masks (numpy arrays with the same shape and datatype boolean where `True` means "take the value" and `False` "discard the value") and index lists (lists of specific indices to return):

In [None]:
myArray = np.arange(50)**2
myMask = myArray % 7 == 1 # All natural squares with rest 1 divided by 7
mySelection = [4,7,9,2,5]
print("Entire data:", myArray)
print("Selecting with mask:", myArray[myMask])
print("Selecting with index list:", myArray[mySelection])

If you want to combine masks, you can do so with the functions `numpy.logical_and`, `numpy.logical_or`, and `numpy.logical_not`:

In [None]:
myArray = np.arange(30)**2
myMaskA = myArray % 7 == 1
myMaskB = myArray % 5 == 1
myMaskCombined = np.logical_and(myMaskA, myMaskB)
print("Selecting with mask A:", myArray[myMaskA])
print("Selecting with mask B:", myArray[myMaskB])
print("Selecting with combined masks:", myArray[myMaskCombined])

You can also create random numbers with numpy using the np.random subpackage.
Numpy has numerours algebraic functions included like the dot product, that we will use a lot or an accessor for the transpose of a matrix.
Here are a few examples, but take a look at the docs yourself.

In [None]:
matA = np.random.sample((3,3))
print("Random matrix:\n",matA)
print("Transposed matrix:\n",matA.T)
matB = np.diag(np.random.randint(0,10,3))
print("Random integer diagonal matrix:\n",matB)
print("Matrix multiplication:\n",matA.dot(matB))

Additionally, numpy includes numerous mathematical functions, that can generally be directly applied to arrays for componentwise operations.

In [None]:
arr = np.random.sample(5)
print("Random values:",arr)
print("Arcus cosines (radians):",np.arccos(arr))
print("Arcus cosines (degrees):",np.arccos(arr) / np.pi * 180)

### pandas

Pandas is a common package to handle datasets.
The most common usecase is to use a table called pandas.DataFrame with named columns and rows.
In this course, you will have to deal with CSV files every now and then and pandas is a convenient way to read those.

In [None]:
import pandas as pd
import urllib
file_path, _ = urllib.request.urlretrieve("https://dm.cs.tu-dortmund.de/nats/data/pokemon.csv")
df = pd.read_csv(file_path)
# If the last line of a cell has a return value, it will be printed.
df

Pandas provides numerous functions to get some information on the dataset at hand, such as the `describe` function.

In [None]:
df.describe()

In contrast to numpy arrays, DataFrames have multiple ways to address data.
The normal indexing with square brackets accepts either a single column name or a list of column names and returns the respective subtable.
To address specific rows, you can chose between the `loc` and `iloc` accessors, that accept an index value (like dictionary key) or an integer index.
In many cases, these will behave the same but DataFrames can have an arbitrary index like e.g. email addresses in a user table.
Slicing on indices can be done directly on the DataFrame (because pandas is a little strange sometimes).

In [None]:
print("Selecting a single column:\n",df["Name"],"\n")
print("Selecting a multiple columns:\n",df[["Name","Attack","Defense"]],"\n")
print("Selecting with index key:\n",df.loc[5],"\n")
print("Selecting with integer index:\n",df.iloc[5],"\n")
print("Selecting a slice directly on the DataFrame:\n",df[4:7],"\n")

Most of the packages we will be working with understand DataFrames just as good as numpy arrays.
Yet sometimes DataFrames feel a bit clunky if you want to compute something.
In that case you can simply cast the DataFrame to a numpy array with the `to_numpy` function.
However, this operation will create a numpy array with the first common super type of data types in the columns.

In [None]:
df["HP"] = df["HP"].astype(float) # Make one of the columns float
print("to_numpy on floats and ints gives an array of floats:\n",df[["HP","Attack","Defense"]].to_numpy().dtype)
print("to_numpy on ints gives an array of ints:\n",df[["Attack","Defense"]].to_numpy().dtype)

### scipy and sklearn

Scipy is a framework for scientific computation based on numpy.
It consists of a core library and multiple attached scipy toolkits (short scikits) that can be installed and imported as separate packages.
Whenever you need some scientific computation that is not provided by numpy, you will most likely find it in scipy.
We won't go into details on scipy here yet and rather address it, when it comes up during an assignment.

Sklearn (short for scikit-learn) implements numerous algorithms from statistics and machine learning.
In Python it is a decent go to package for reference implementations in that field even though they are not all state of the art (anymore).
For this course, however, they present a very decent baseline for your learning experience.
As with scipy we won't discuss any details on sklearn yet.

### plotly and matplotlib

In the Python community matplotlib is the default visualization package.
It can be used to display images, 3d renderings, curve plots, histograms, and anything else, that you might wish to render.
Matplotlib's workflow is centered on a state machine and at times feels somewhat dated and uncomfortable.
Many packages have been introduced to simplify rendering in Python and reduce the amount of code required for simple plots.

My favorite contender (with a very decent documentation) is plotly.
Plotly uses an object oriented plotting pipeline rather than a state machine and (to my mind at least) has an overall easier understandable concept.
The core of plotly is written in JavaScript, which also makes it a lot easier to make the plots interactive, as event handlers can be cross compiled.

As we will need to create quite a lot of plots, you should get comfortable with at least one of the two packages.
Matplotlib is as far as I know still the more common solution but if you ask me, you would miss out on something when not using plotly.

Here are some example codes (try clicking the legend entries in the plotly plot):

In [None]:
import numpy as np
xs = np.arange(0,2*np.pi,.05)
coss = np.cos(xs)
sins = np.sin(xs)

from plotly import graph_objects as go
go.Figure([
    go.Scatter(x=xs,y=coss,name="cosine"),
    go.Scatter(x=xs,y=sins,name="sine")
], layout=dict(title="Plotly")).show()

from matplotlib import pyplot as plt
fig, ax = plt.subplots()
ax.plot(xs, coss, label='cosine')
ax.plot(xs, sins, label='sine')
legend = ax.legend()
ax.set_title("Matplotlib")
fig.show()

That's it for packages. In the last notebook we will take a look at some advanced Python features.