# Python from Zero: The absolute Beginner's course

## Session 3 / 4 - 22.02.2023 9:00 - 16:00
<br>
<font size="3">
    <i>by Fabian Wilde, Katharina Hoff, Matthis Ebel, Natalia Nenasheva & Mario Stanke<br></i><br>
<b>Contacts:</b> nenashen66@uni-greifswald.de, matthis.ebel@uni-greifswald.de 
<br>
</font>
<br>

## Importing Modules and Namespaces
<br>
<font size="3">
Structuring the code in functions which can be easily reused and maintained is the first step to achieve cleaner and leaner code. Nevertheless, in big Python projects, it would result in not really well readable code and collaborating on a common project would be difficult, if all the code would be in one single file. Nowadays Python projects are therefore organised in modules, allowing to distribute parts of the code over multiple files as well.<br><br> The figure below shows an exemplaric project directory structure organized in modules and submodules:<br><br>
</font>
<div align="center">
    <img src="img/absolute-import.jpg" width="40%">
</div>
<br>
<font size="2"><i>Source: <a href="https://www.geeksforgeeks.org/absolute-and-relative-imports-in-python/">https://www.geeksforgeeks.org/absolute-and-relative-imports-in-python/</a></i></font>
<br><br>
<font size="3">
<b>In order to use an existing (3rd party) module in your new script or project, you need to import the module using</b><br><br>
<b><font face="Courier">import <i>module_name</i></font></b><br><br>
<b>or define a shorter alias for the module name, if it is too long when used in the code:</b><br><br>
<b><font face="Courier">import <i>module_name</i> as <i>alias</i></font></b><br><br>
<font size="3">
<b>You can also import a specific class or function from module:</b><br><br>
<b><font face="Courier">from <i>module_name</i> import <i>function_name</i> as <i>alias</i></b></font><br><br>
<b>You can also further specifiy an <i>absolute path</i> to the submodule (or the file) to import a specific class or function from:</b><br><br>
<b><font face="Courier">from <i>module_name</i>.<i>submodule_name</i>.<i>file_name</i> import <i>function_name</i></font><br><br>
</b>
<b>Related to the project structure above, we could use the statement:</b><br><br>
<b><font face="Courier">from pkg2.subpkg1.module5 import fun3</font></b><br><br>
    to specify an <b><i>absolute import</i></b> path. However, absolute imports are discouraged to use when the directory structure is very large.<br><br>
<b>A relative import with respect to the project structure in the figure above may be defined using</b><br><br>
<b><font face="Courier">from .subpkg1.module5 import fun3</font></b><br><br>
    
The best practice in Python regarding imports can be found in the official Python style guide <a href="https://www.python.org/dev/peps/pep-0008/#imports">PEP8</a>.<br> The most important best practices are:
<ul>
    <li>import statements should be located at the beginning of your script.</li>
    <li>import statements should be sorted in alphabetical order for their module names.</li>
    <li>standard library imports before 3rd party imports.</li>
</ul>
</font>

### Examples:

In [None]:
# standard library imports first
# use for each import a new line
import math
import os

# the most popular example to use numpy
# imports numpy and sets the namespace to the alias "np"
# the module content is then accessible via the alias
import numpy as np

# the most popular example to plot/visualize data
from matplotlib import pyplot as plt
# also possible: import matplotlib.pyplot as plt

# example of importing a specific class from a module (importing class `tqdm` from the model `tqdm`)
from tqdm import tqdm

<font size="3"><b>You can check the path, version and short documentation of a module using the module attributes:</b></font>

In [None]:
# the hidden attribute __version__ contains the module version
print(np.__version__)
# the hidden attribute __path__ contains the module path or location
print(np.__path__)
# the hidden attribute __doc__ contains a short documentation
# or description of the module, the so-called docstring
print(np.__doc__)

## The most popular Python Libraries (3rd party Modules)

<br>
<font size="3">
The most popular Python libraries which you will use yourself sooner or later are <br>
<ul>
<li><b><a href="https://numpy.org/">numpy</a>:</b><br>Numpy is one of the most widely used Python libraries. It offers fast handling and efficient storage of bigger amounts of numerical data in numpy arrays as well as a variety of useful functions to read-in various data file formats.</li><br>
<li>
<b><a href="https://matplotlib.org/">matplotlib</a>:</b><br>Matplotlib is the most common Python library for data visualization. The library <a href="https://seaborn.pydata.org/">seaborn</a> builds on matplotlib and offers more beautiful plots and more sophisticated plot types for statistics.
</li><br>
    <li>
        <b><a href="https://pandas.pydata.org/">Pandas</a>:</b><br>Pandas is a library for chart data visualization and also for analysis and handling of large amounts of data. It became most popular for time series analysis and is also commonly used in finance.
    </li><br>
    <li>
     <b><a href="https://www.scipy.org/">scipy</a>:</b><br>Scipy is a general purpose library for science and engineering offering mostly functionalities for signal analysis, filtering and regression.
    </li><br>
        <li>
     <b><a href="https://www.statsmodels.org/">statsmodels</a>:</b><br>As the name implies, the library Statsmodels offers a big variety of statistical models and tests for your data.
    </li><br>
    <li>
     <b><a href="https://scikit-image.org/">scikit-image</a>:</b><br>Scikit-image offers functionalities for automatic image processing, enhancement and segmentation.
    </li><br>
     <li>
     <b><a href="https://scikit-learn.org/">scikit-learn</a>:</b><br>Scikit-learn offers a variety of machine learning models via keras as well as funtions for statistical tests and data fitting routines.
    </li><br>
     <li>
     <b><a href="https://www.tensorflow.org/">tensorflow</a>:</b><br>Tensorflow is the most popular machine learning library developed mostly by Google besides the competing <a href="https://pytorch.org/">pytorch</a> by Facebook.
    </li>
</ul>
</font>

---

## How to list installed modules and install new modules in Python

<br>
<font size="3">
If you're working with an existing Python installation, the most important modules are usually already installed. However, "less popular" modules that you need may be missing, or on a new Python installation, even Numpy etc. are often not yet installed.<br><br>
In most Python environments, a package manager is used to administer the installed modules.<br> 
In most cases, this is either <a href="https://docs.python.org/3/installing/index.html">pip</a> or, if you use the Anaconda environment, it will be <a href="">conda</a>.<br><br>
In the following, the most important commands are listed to check which modules are installed and how to install a new one.<br>
<b>Run these commands in a terminal, or in a special cell of the notebook!</b>
<!--    <b>In order to check which modules are installed, you can use the command:</b><br><br>
    <i>pip list</i>
    <br><br>
    or
    <br><br>
    <i>conda list</i><br><br>
    on the command-line prompt in a console or use directly-->
</font>

pip package manager:
    
    pip -h                    # get help
    pip list                  # list installed packages
    pip install module_name   # install the package module_name
    
conda package manager:
    
    conda -h                    # get help
    conda list                  # list installed packages
    conda install module_name   # install the package module_name

In [None]:
# example of a special notebook cell: note the ! at the beginning of the line, this makes Jupyter run the line as in a terminal!
! conda -h

---

### Side Note: Conda Environments
<font size="3">
If you have multiple projects using Python, it can happen that some third party modules are not compatible!<br>
For example, if you need some older version of a module in one project, which can happen when legacy code relies on some old functionalities, it might be that in another project you cannot install another module that relies on a newer version of the other module. This can get messy...<br><br>
    
Luckily, you can have different <i>environments</i>. These behave as if you have different Python installations, and in each environment, you can install the combination of modules that you need, without disturbing the modules of another environment.<br><br>

Here are some commands to use environments in conda:

    conda env -h                           # get help
    conda env list                         # list all existing conda environments
    conda create --name environment_name   # create a new environment with the name environment_name
    conda activate environment_name        # switch to the environment with the name environment_name
    conda info                             # print information about conda and the current environment

In [None]:
# if you don't create and load an environment yourself, conda puts you in the `base` environment:
! conda info

---

<font size="3"><div class="alert alert-warning"><b>Exercise 1:</b><br>Create your own module! Write a function `square()` in the module, that takes a single numeric argument and returns the square of that number. Load the module in this notebook and call the function from here.
</div>
    

#### How-to:
* Create a new Python script in today's directory (`pyzero/Session 3`): In the file tree on the left, navigate to today's directory, right-click with your mouse somewhere in the file tree, and select "New File"

![](img/newfile.png)

* Name the file `my_module.py` (or any name you like)
* Double-click on the file you just created and write your function in that file, don't forget to save afterwards!
* Use the `import` statement with your module filename (omit the `.py` ending, i.e. `import my_module`)  
  **Remark:** This works, because the notebook and the module are in the same directory. Including custom modules from other places might be more complicated!
* Try to access the function in the module and calculate the square of some numbers!

### Try it yourself:

#### Example Solution:

In [None]:
from my_module import square
print(square(5))

## Interlude: Plain File Loading in Python

Python offers a way to read text files line-by-line. That is great, because now we can write programs that can load (lots of) data from a file. Before, we always had to hard-code the data we wanted to use, which is not very realistic!  

The syntax to do so is

    with open('path/to/file', 'rt') as file_handle:
        # do stuff with file_handle
        
`path/to/file` is the path from the current working directory to the file that you want to open. `rt` stands for "read text", this tells the `open()` function that it should expect a text file and that we want to read from it  

Python takes care of opening the file and creating a _stream_ that can be used through `file_handle`. The `file_handle` can behave like an iterable, but it also has reading methods:

* `readlines()` - reads the complete file and returns a list containing all the lines  

        with open('path/to/file', 'rt') as file_handle:
            for line in file_handle:
                print(line)
        
        
        
        with open('path/to/file', 'rt') as file_handle:
            file_content = file_handle.readlines()
        
        print(file_content)
    


Note: after the indented code below `with open(...) as ...:` has ended, Python automatically closes the connection to the file. That's convenient, otherwise we would have to do this manually!

In [None]:
# Read the first 10 lines of the file `data/iris.csv` and print them
with open("data/iris.csv", "rt") as fh:
    a = ""
    for line in fh.readlines():
        a = a + line

In [None]:
a[:2000]
with open("data/iris_out.csv", "wt") as file_handler:
    file_handler.write(a[:2000])

In [None]:
# Read the complete file `data/iris.csv` and print the first 10 lines
with open("data/iris.csv", "rt") as fh:
    content = fh.readlines()
    
content[:10]

**Note:** For many file types like CSV (comma seperated values), JSON, etc., there are better ways to read provided by (third-party) modules!

#### File Writing

Of course, we can also write to files! The approach is very similar:

    with open('path/to/file', 'wt') as file_handle:
        # do stuff with the file_handle, e.g.
        file_handle.write("This line is written to the file!")
        
`wt` stands for "write text" and tells the `open()` function that it should open the file for writing, and that we want to write text to it.  

Useful `file_handle` methods are
* `write(string)` - writes the string to the file
* `writelines([string1, string2, ...])` - writes each element in the list to the file

<font size="3"><div class="alert alert-warning"><b>Exercise 2.1:</b><br>
Create your first file with Python! Write the data of the dictionary 'table' from Session 1 into a text file 'table.txt'. In the file, write each element of the dictionary into a separate line, start each line with the key and then add the values separated by commas.<br>       
</div>
The file should look like:<br>
Name: Rosalind,Jane,John,Albert,Marie,<br>
Surname: Franklin,Doe,Doe,Einstein,Curie,<br>
Age: 100,41,44,141,153, 
    
<font size="3"><div class="alert alert-warning"><b>Bonus exercises:</b><br>
    1. Write the dictionary 'table' to the file as in Exercise 2.1, but this time use the comma only as a separator and make sure that the lines don't end with a comma. <br>
    2. Write the dictionary 'table' columnwise to a file, such that the output looks like:</div> 
    </div>
    Name,Surname,Age<br>
    Rosalind,Franklin,100<br>
    Jane,Doe,41<br>
    John,Doe,44<br>
    Albert,Einstein,141<br>
    Marie,Curie,153
<br>       


In [None]:
table = {"Name" : ["Rosalind", "Jane", "John", "Albert", "Marie"], 
         "Surname" : ["Franklin", "Doe", "Doe", "Einstein", "Curie"],
         "Age" : [100, 41, 44, 141, 153]}

### INSERT YOUR CODE BELOW

### Example Solution

In [None]:
table = {"Name" : ["Rosalind", "Jane", "John", "Albert", "Marie"], 
         "Surname" : ["Franklin", "Doe", "Doe", "Einstein", "Curie"],
         "Age" : [100, 41, 44, 141, 153]}

with open('table.txt', 'wt') as file_handler:
    for key in table:
        file_handler.write(key + ": ")
        for value in table[key]:
            file_handler.write(str(value) + ",")
        file_handler.write("\n")
    

#### Bonus exercise 1

In [None]:
table = {"Name" : ["Rosalind", "Jane", "John", "Albert", "Marie"], 
         "Surname" : ["Franklin", "Doe", "Doe", "Einstein", "Curie"],
         "Age" : [100, 41, 44, 141, 153]}

with open('table.txt', 'wt') as file_handler:
    for key in table:
        file_handler.write(key + ": ")
        file_handler.write(','.join(map(str, table[key])))
        file_handler.write("\n")
    

#### Bonus exercise 2

In [None]:
table = {"Name" : ["Rosalind", "Jane", "John", "Albert", "Marie"], 
         "Surname" : ["Franklin", "Doe", "Doe", "Einstein", "Curie"],
         "Age" : [100, 41, 44, 141, 153]}

with open('table.txt', 'wt') as file_handler:
    file_handler.write(','.join(table.keys()) + "\n")
    for i in range(5):
        line = ""
        for key in table:
            line += str(table[key][i]) + ","
        file_handler.write(line[:-1] + "\n")
    

<font size="3"><div class="alert alert-warning"><b>Exercise 2.2:</b><br>
Create a file from Python! First, create a `dict` with some keys (`str`) and some values. The values might be `int`, `float`, `str`, `bool`, `list` or `dict` (please do not use `tuple` and `set`!).<br>
Open a file for text writing, and use the function `dump()` from the module `json` to write your dictionary to that file!
</div>
    
Hints:
* The use of `dump` is: `dump(dictionary_name, file_handle)`
* You don't need to use another `write()` or `writelines()` method!

In [None]:
import json

# YOUR CODE HERE

### Example Solution

In [None]:
import json

my_dict = {
    'key1': 1,
    'key2': [True, False],
    'foo': 'bar',
    'new key': 42
}

with open('testfile.json', 'wt') as file_handle:
    json.dump(my_dict, file_handle)

In [None]:
with open('testfile.json', 'rt') as fh:
    for line in fh:
        print(line)

---

### Interlude: Getting Quick Help in Jupyter Notebooks

Most built-in functions and functions from properly-written modules contain special strings that can be  
turned into a short _documentation_

Jupyter offers a handy shortcut to view these strings: Write the function that you want help for,  
then press and hold the `Shift` key, then press the `Tab` key

![](img/shiftTab.png)

![](img/nbhelp.png)

---

## A short introduction to Numpy

<font size=3>
    Recently, a <a href="https://www.nature.com/articles/s41586-020-2649-2">paper</a> about the numpy Python package was published in <i>nature</i>.<br><br>
        <div align="center">
<img src="img/numpy_nature.webp" width="100%">
        </div>
<font size="2"><i>Source: <a href="https://www.nature.com/articles/s41586-020-2649-2">https://www.nature.com/articles/s41586-020-2649-2</a></i></font>
<br><br>
<b>Numpy</b> is a powerful 3rd party package offering the new data type of the <b><i>numpy array</i></b> (numpy.ndarray) with a more powerful and faster implementation in C++ in the background. In contrast to lists in Python, <b>the size of a numpy array cannot be changed and the best practice is to allocate space in advance by initializing an empty array (e.g. filled with zeros)</b>. Also, the data type of its elements shouldn't differ and the number of elements in each row or column has to be the same, since <b>the numpy array represents a N x M matrix.<br><br>
<!--<b>To sum it up, a numpy array behaves like a "classic" array in other high-level programming languages due to it's precompiled C++ implementation in the background.</b><br>-->
</font>

<font size="3">For using Numpy, you have to import the package once. Since you will in the following often have to refer to numpy, we assign an alias that is faster to type (np):</font>

In [None]:
# import numpy with alias np
import numpy as np

### NumPy Array

<br>
<font size="3">A NumPy Array can have many dimensions. Let's start with one (similar to a Python list) and two dimensions (similar to a matrix):</font>

In [None]:
# create a numpy array from a list
my_list = [1.0,2.35,3.141]
arr1 = np.array(my_list)
print("type(my_list) =", type(my_list))
print("type(arr1) =", type(arr1))
print("arr1 =", arr1)

In [None]:
# create an 2D array / a 3x3 matrix filled with zeros
arr2 = np.zeros((3,3))
print("type(arr2) =", type(arr2))
print("arr2 =", arr2, sep="\n")

In [None]:
# create a numpy array filled with a specific value
arr3 = np.full((3,2), 2)
print("arr3 =", arr3, sep="\n")

In [None]:
# index an array element in a multi-dimensional array
arr2 = np.zeros((3,3))
print("arr2 =", arr2, sep="\n")
print()
arr2[0,0] = 1
arr2[1,1] = 2
arr2[2,2] = 3
print("type(arr2) = "+str(type(arr2)))
print("arr2 = \n"+str(arr2))

In [None]:
# using the attribute "size" of the numpy array is more reliable than using the len() function
# the attribute "shape" contains a tuple with the array or matrix dimensions
print("arr2.size = "+str(arr2.size))
# convention for the shape tuple: (number of rows, number of columns) in case of a 2D array
print("arr2.shape = "+str(arr2.shape))

#### More Indexing

Indexing and slicing works just like with Python `list`s

In [None]:
data = np.array([1, 2, 3])

print("data[1]:  ", data[1])
print("data[0:2]:", data[0:2])
print("data[1:]: ", data[1:])
print("data[-2:]:", data[-2:])

With higher-dimensional arrays, you can put a comma-separated list of indices in the `[ ]` operator. The `:` means that you want a slice of _all_ values in the respective dimension

In [None]:
arr4 = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
print("arr4\n", arr4, "\n")
print("arr4[2,1]", arr4[2,1], "\n")
print("arr4[0,:]", arr4[0,:], "\n")
print("arr4[0]  ", arr4[0], "\n")
print("arr4[:,0]", arr4[:,0], "\n")
print("arr4[0:2,1:3]\n", arr4[0:2,1:3])

<font size="3"><div class="alert alert-warning"><b>Exercise 3:</b><br>
Practice using Numpy! Create a 3-dimensional Numpy array with zeros of the shape `(3,2,4)`.<br>
Replace the first <i>2 x 4</i> matrix with an array of the same shape (`(2,4)`), containing only twos.<br>
In the second <i>2 x 4</i> matrix, write all numbers from 0 to 7.<br>
In the third <i>2 x 4</i> matrix, store the result of the element-wise multiplication of the first two matrices
</div>

Hints:
* `np.array([1,2,3,4]) * 2` results in the array `[2,4,6,8]`
    
* `np.array([1,2,3,4]) * np.array([1,2,3,4])` performs element-wise multiplication and results in `[1,4,9,16]`
    
The result should look like this:
    
    [[[ 2.  2.  2.  2.]
      [ 2.  2.  2.  2.]]

     [[ 0.  1.  2.  3.]
      [ 4.  5.  6.  7.]]

     [[ 0.  2.  4.  6.]
      [ 8. 10. 12. 14.]]]
    
or this
    
    [[[ 2.  2.  2.  2.]
      [ 2.  2.  2.  2.]]

     [[ 0.  2.  4.  6.]
      [ 1.  3.  5.  7.]]

     [[ 0.  4.  8. 12.]
      [ 2.  6. 10. 14.]]]

In [None]:
# YOUR CODE HERE

### Example Solution

In [None]:
x = np.zeros((3,2,4))

x[0] = np.ones((2,4)) * 2

v = 0
for i in range(2):
    for j in range(4):
        x[1,i,j] = v
        v += 1
        
x[2] = x[0]*x[1]
print(x)

<font size="3"><b>Sometimes you want or need to change the shape of your numpy array using the reshape method:</b></font>

In [None]:
# defines a numpy array with row number unequal column number
arr3 = np.array([[2,3],[3,5],[6,4]])
print("content of arr3:")
print(arr3)

In [None]:
# prints the shape (number of rows, number of columns)
print("shape of arr3:")
print(arr3.shape)

In [None]:
# transposes the array / matrix
arr3 = arr3.T
print("transposed array (rows and columns interchanged):")
print(arr3)

In [None]:
# prints the shape of the transposed array
print("shape of the transposed arr3:")
print(arr3.shape)

In [None]:
# changing the shape could be achieved using the reshape method expecting a tuple 
# (total number of elements needs to be the same!)
arr3 = np.array([[2,3],[3,5],[6,4]]) # reset array to original
arr3 = arr3.reshape(2,3)
print(arr3) # note that the order of elements might be different from arr3.T

<font size="3"><b>Often you would like to perform computations over an entire row or column of your array (hence your dataset) for some statistical evaluation, like:
    </b></font>

In [None]:
# generates a 2D matrix of random floating-point numbers
# np.random.normal generates normally distributed random numbers and expects (mean, standard deviation, size=(a,b))
arr = np.random.normal(0,2,size=(6,8))
print("random number array:")
print(arr)

In [None]:
# sums all values in each column
col_sum = np.sum(arr, axis=0)
print("col_sum = "+str(col_sum))

In [None]:
# sums all values in each row
row_sum = np.sum(arr, axis=1)
print("row_sum = "+str(row_sum))

In [None]:
# multiplies all values in each column
col_prod = np.prod(arr, axis=0)
print("col_prod = "+str(col_prod))

In [None]:
# multiplies all values in each row
row_prod = np.prod(arr, axis=1)
print("row_prod = "+str(row_prod))

In [None]:
# calculate mean
total_mean = np.mean(arr)
print("Overall mean:"+str(total_mean))
# calculate standard deviation
total_std = np.std(arr)
print("Overall standard deviation:"+str(total_std))

In [None]:
# calculate mean over all elements in a column (iterates over first array dimension)
col_mean = np.mean(arr, axis=0)
print(col_mean)

In [None]:
# calculate mean over all elements in a row (iterates over second array dimension)
row_mean = np.mean(arr, axis=1)
print(row_mean)

#### Numpy Performance

Other than being very convenient to use, these Numpy functions also speed up your code!  

Take for example the sum of all matrix elements. One could compute this manually by accessing all the elements and summing up on the way

As we will see, numpy functions are much faster!

In [None]:
import time # load the module time, contains functions to measure time

big_array = np.random.normal(0,2,size=(1000,8000)) # large array with 8 million elements

# manual sum calculation
start_timepoint = time.time() # store the current time
manual_sum = 0
for i in range(0, 1000):
    for j in range(0, 8000):
        manual_sum += big_array[i,j]

duration = time.time() - start_timepoint # calculate how many seconds have passed since `start_timepoint` and now
print(manual_sum, "\nManual calculation took", duration, "s")
print()

# numpy sum
start_timepoint = time.time() # store the current time
np_sum = big_array.sum()
duration = time.time() - start_timepoint # calculate how many seconds have passed since `start_timepoint` and now
print(np_sum, "\nNumpy calculation took", duration, "s")

<font size="3"><b>Of course, you can also do elementwise manipulations and matrix multiplications (that was one of the original purposes of numpy):</b></font>

In [None]:
arr = np.array([1,2,3,4])
print("arr:")
print(arr)

In [None]:
# e.g multiply each element with value
print("result of arr * 3.141:")
print(arr * 3.141)

In [None]:
# compute the products of two arrays element-wise
arr2 = np.array([5,6,0,1])
print("arr2:")
print(arr2)
result = arr * arr2
print("result of arr * arr2:")
print(result)

In [None]:
# compute the dot or scalar product of two vectors
result = np.dot(arr, arr2)
print("result of arr (dot) arr2:")
print(result)

In [None]:
# more general, compute the product of two matrices
arr = np.array([[1,2],[5,6],[4,2]])
arr2 = np.array([[5,6,7],[2,3,2]])
print("arr:")
print(arr)
print("arr2:")
print(arr2)
result = np.matmul(arr, arr2)
print("result of matmul(arr,arr2):")
print(result)

### Loading Files with Numpy

<font size="3">Often, you don't want or you can't define a list or numpy array manually with your data. Instead you have a file e.g. a CSV file with comma-separated values and you need to convert the content yourself to a table-like datastructure. In this example, we have the famous <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris dataset</a> (in this case a very popular standard data set in data science, containing data of samples from Iris flowers of three related species). This file could look like this:</font>

![](img/iris.png)  

</font>
<font size="3">
Of course, we could now manually read-in the file line by line as shown above, but then we would need to process the strings and convert them to a table-like data structure. <b>Luckily, there are already functions in the 3rd party libraries <a href="https://numpy.org/"><i>numpy</i></a> and <a href="https://pandas.pydata.org/"><i>pandas</i></a> that solve this problem.</b><br>
We can read-in the file using <a href="https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html">numpy.genfromtxt</a></b></font>

In [None]:
# print the documentation string of the function
print(np.genfromtxt.__doc__)

In [None]:
# have a look at the raw file
with open("data/iris.csv", "rt") as file_handle:
    lines = file_handle.readlines()
    
lines[:10]

In [None]:
# load data file and convert it to (named) numpy array
iris_data = np.genfromtxt("data/iris.csv", names=True, delimiter=",")
# output just the first 20 rows
print("type(iris_data) = "+str(type(iris_data)))
print("iris_data[:20] =\n"+str(iris_data[:20]))
#print the guess data types for the columns of the table
print("iris_data.dtype = \n"+str(iris_data.dtype)) # \n is the newline character: it does not get printed as shown here, but print() begins a new line at this point!

<font size="3"><b>Obviously, the numpy function couldn't properly guess the data type of the last column (just returning a <i>NaN</i> (not an number)).</b><br> 
For tables with mixed data (i.e. numbers and strings) there are better suited classes, e.g. the `DataFrame` from the `pandas` module, which we will see later. <!-- But we can correct that:</b>--></font>

<!--USED TO BE A CODE CELL
# load data file again with defined data types and column names and convert it to (named) numpy array
iris_data = np.genfromtxt("data/iris.csv", names=True, delimiter=",", \
                          dtype=[('sepal_length', '<f8'), ('sepal_width', '<f8'),\
                                 ('petal_length', '<f8'), ('petal_width', '<f8'),\
                                 ('species', '<S8')])
print("iris_data[:20] = "+str(iris_data[:20]))
#print the guess data types for the columns of the table
print("iris_data.dtype = \n"+str(iris_data.dtype))

# access column data with column name
print("iris_data['sepal_length'] = \n"+str(iris_data['sepal_length'][:20]))

#work with the data
#compute mean of column
mean = np.mean(iris_data['sepal_length'])
#compute standard deviation of column
std = np.std(iris_data['sepal_length'], axis=0)
print("Mean of column sepal_length:" + str(mean))
print("Std of column sepal_length:" + str(std))

# round to n significant digits with np.round
print("Mean of column sepal_length:" + str(np.round(mean,3)))

# or just change the number output format
-->

<font size="3"><div class="alert alert-warning"><b>Exercise 4:</b> Write a function which loads a file (<b>in this case applied to <i>data/glass.csv<i></b>), print the shape of the resulting array, print the first N rows (adjustable via a function argument), compute and return the means and standard deviations for the data columns and print the results (adjustable via a parameter).<br><br>
    <b>Hint:</b> <b>Use the numpy functions <a href="https://numpy.org/doc/stable/reference/generated/numpy.mean.html">np.mean</a> and <a href="https://numpy.org/doc/stable/reference/generated/numpy.std.html">np.std</a></b> to compute the mean and the standard deviaton. The parameter axis allows to specifiy for which dimension/axis you'd like to perform the computation. So you don't need to use a loop.<br><br>
If you use the keyword argument <b>names = True</b> in numpy.genfromtxt, a named array will be returned where column data can be addressed via the column name defined in the first row of the file.
</div>
    
<b>Try it yourself here:</b></font>

### Example Solution:

In [None]:
import numpy as np  

def process_file(file, n_rows = 10, print_result = True): 
    glass_data = np.genfromtxt(file, delimiter=",", skip_header=1) 
    print(glass_data[:n_rows,:]) 
    means = np.mean(glass_data, axis = 0) 
    stds = np.std(glass_data, axis = 0) 
    if print_result: 
        print("means:"+str(means)) 
        print("stds:"+str(stds))
        
    return means, stds

means, stds = process_file("data/glass.csv", n_rows = 10, print_result = True)

## Directory Batch Processing
<br>
<font size="3">
Very often, you not only need to load a single file, but a whole set of files located in one or more directories. Luckily, Python also offers builtin solutions for this problem. A directory can be traversed (also recursively) to yield a list of files to be processed in your Python script.<br><br>
    <b>The most simple solution is offered by the builtin library <i>os</i> with <i>os.walk</i>.</b>.
</font>

### Examples:

In [None]:
# imports the required module
import os

def walk_through_files(path, file_extension='.txt'):
    for (dirpath, dirnames, filenames) in os.walk(path):
        for filename in filenames:
            if filename.endswith(file_extension): 
                # yield keyword instead of return defines a generator instead of a regular function
                yield os.path.join(dirpath, filename)

# that's why we iterate over walk_through_files instead of calling it
for fname in walk_through_files("data/batch/"):
    print(fname)
# force the generator to return a list with the results
print(list(walk_through_files("data/batch/")))

In [None]:
print(np.loadtxt.__doc__)

<font size="3"><div class="alert alert-warning"><b>Exercise 5:</b> Write a function to traverse the path data/batch/, read the found files and concatenate their content to one string separated by spaces.
</div>
    
Hints: 
* concatenate strings in a list `ls` using a string `s` as separator with `s.join(ls)`, e.g.  


    print(" ".join(['create', 'a', 'single', 'sentence']))
    
    create a single sentence


* Remove trailing spaces and newline characters from a string with the string method `rstrip()`
    
<b>Try it yourself here:</b></font>

### Example Solution:

In [None]:
# iterate over files
buffer = []
for fname in walk_through_files("data/batch/"):
    # open file
    with open(fname, 'rt') as fh:
        # read file content and put it in the buffer list
        buffer.extend(fh.readlines())
    
# remove newlines
for i in range(len(buffer)):
    buffer[i] = buffer[i].rstrip()

print("Result:", " ".join(buffer))

In [None]:
print(str.join.__doc__)