## Data Mining laboratory - Introduction

Welcome to the data mining class. During our meetings, we will be dealing with processing and exploring data with the use of the Python language in the Jupyter Notebook setting. We are also going to use low-code and no-code solutions to the presented problems. Today, we are going to set up our working stations and get familiar with the setup.


#### Course assignements

This course consists of X notebooks, X homeworks and 3 assignments. In order to get a pass mark, you need to complete all homeworks. You can get a maximum of 4 points for each assignment. Once the assignment is announced you have two weeks to complete it. Each week of delay deducts 1 point from the mark you get. The amount of points you gather during the course will indicate your final grade.

| Points | Grade |
| ------ | ----- |
| 0      | 2.0   |
| 6      | 3.0   |
| 7.5    | 3.5   |
| 9      | 4.0   |
| 10.5   | 4.5   |
| 11.5   | 5.0   |


### What is the Jupyter Notebook?

It's a computing platform that is very commonly used for code presentation, on-hand code execution, as well as preparing code snippets, which later on might be used in a larger library. In this setting you can easily combine Markdown text and executable Python code. This format is very popular in machine learning, data mining and artificial intelligence field in general. A single file in this setting is very often referred to as just a _Notebook_. The file you are viewing right now is a Notebook. Notebook files are usually named with the extension _.ipynb_, which stems from the original open-source project name _IPython Notebook_. The Notebook uses an interactive kernel, which allows us to maintain the current execution of the code. During the execution, all variables, defined functions, and classes, etc. are stored in the memory, which gives us flexible access to everything we coded (this is nothing new compared to a standard Python interpreter). The Notebooks are delivered to us in several different settings, here are some:

- **Advanced, modern IDE, which supports Jupyter Notebooks.** In this setting, the IDE is responsible for setting up the interactive kernel with the use of the Python interpreter. A good example of an IDE, which supports the Jupyter Notebooks is Visual Studio Code. Prior to using this option, we need to set up the Python interpreter on the machine.

  **Pros**:

       * full customization
       * full access to data on hand
       * usually supports version control
       * easy setup process

  **Cons**:

       * you need to set up an IDE on every machine you work on
       * requires installation of Python interpreter on the machine

- **A stand-alone Jupyter Notebook server.** This is the original method of delivering the Notebooks. In order to use this setting, one must download and run the Jupyter Server as a separate process on a machine on hand. The Jupyter Notebook server often comes in bundle with complete Python distributions (e.g. WinPython), in that case, the server executable file is usually within the Python folder. The Jupyter Notebook server allows us to access, view and run the notebooks via the web application accessible through a browser. The server allows us to set up the connection details (e.g. the IP address, port, authentication method, password). If you want to use the server in a public network. you need to be very careful while using this option, as it allows an easy access to the Remote Code Execution, which is a substantial vulnerability. Whoever has the access to the _Notebooks_ via the server, essentially has the same privileges, as the user, who started the server. Nothing stops us from using the server on the _localhost_. Running the server in a default setting is as simple as running the command:

               jupyter notebook

  Once the server is running, you have access to files and directories, starting with the directory on which, the server was started. Opening the notebook file, switches the application view, so that you can execute the code and read the markdown.

  **Pros**:

       * full customization
       * access to data on the server machine
       * ability to use it in a network setting with many users and a single server

  **Cons**:

       * fairly hard setup process (if you want to use it with several users in a network setting)
       * if you do not have a server machine, you can only run it in an offline setting
       * no native support for version control
       * requires installation of Python interpreter on the machine

- **External Notebook server paired up with virtual machine.** In this setting, we are using a virtual machine with a temporary python environment as the working space. Although we are not forced to maintain the Notebook server, this option comes with several limitations. We are forced to follow the rules of the virtual machine provider. Usually we not permitted to use such a notebook in order to host data, download torrents, use it as an SSH server, connect to the remote proxy, etc. (nothing really related to Data Mining). Such a notebook does not have direct access to our files, we usually need to upload the data on the virtual machine (or a cloud drive) in order to process the data. Other than that, we can consume the Nootebook files as normal. A good example of this setting is Google Colaboratory.

  **Pros**:

       * access to notebooks on any machine with no setup
       * limited customization
       * ability to modify and create new Notebooks on hand on any machine
       * no need to install any software on the machine (except for a browser)

  **Cons**:

       * no support for version control
       * restrictions of use
       * requires an account (e.g. Google Account)
       * limited access to data on hand
       * requires uploading the data to an external server (usually limitted space)
       * limited customization


In this course I propose one of the two options - those options are not obligatory, you can use any setup you want:

- Visual Studio Code
- Google Colaboratory


#### Setting up Visual Studio Code.

In the class we will be using the Google Colaboratory service. However, if you want to make your setup at home, or with a personal laptop, you can use the Visual Studio Code setup. Process of setting it up comprised of 2 (pretty obvious) steps:

- installing Python interpreter -
  - if you are using a Linux machine, it is very likely you already have the Python interpreter installed. If this is not the case, use your default package manager to install python (i.e. `apt install python3` on Debianoids).
  - if you are using a Windows machine I suggest using a [WinPython](https://winpython.github.io/) package. It comes with a pre-installed set of libraries.
  - you can also use the [default Python installer](https://www.python.org/downloads/).
- installing Visual Studio Code - VSC is an multi-platform IDE. You can find it [here](https://code.visualstudio.com/).

Once you have everything installed you need to create a space on the computer for this class (we are going to use toy data sets, so you do not need gigabytes of free space). You start by creating a dedicated directory on your hard drive. Download this notebook (.ipynb version, not the html) and paste it into the newly created directory. Then, you open the Visual Studio Code application and from the File menu you choose the Open Directory option. In the file explorer you should be able to see this notebook. Upon the first execution of the code block you will need to choose a Python interpreter, which you have already installed.


#### Using Google Colaboratory

Using Google Colab is much easier. You just need to download this notebook, log in to your Google account on the [Colaboratory website](https://colab.research.google.com/). From the File menu use the "Send notebook" option. Choose the downloaded file. That's it.


Once you have everything set up, switch from the HTML version of the notebook to the interactive one (either in Colab or in VSC). Starting the next week you will be downloading and opening the notebook at the beginning of each class.


### How is a Notebook organized?

Each Notebook consists of list of cells. There are two types of cells:

- **Code cell** - the code cell is filled with the code in the programming language the Notebook is set up for (usually it's Python). You can execute the code and immediately see the result. Everything that _happened_ in the execution is available in the next cell you run. Once the code cell is executed, it is annotated with a number, which refers to the order of execution. The first cell you run will be annotated with number [1], second with number [2], etc. The enumeration helps us to keep up with the current status of execution.
- **Markdown cell** - the markdown cell allows us to insert a formatted text into the notebook. The text is formatted with use of the [Markdown](https://www.markdownguide.org/) language. The Markdown is a lightweight markup language, which is used to add simple formatting to plaintext documents. It was created in 2004 by John Gruber. It is one of the most popular markup languages. This is the same language you can use for example in the Discord app.

Each of the code cells can be executed at any point. In most of the IDEs we are allowed to run all cells at once, restart the interpreter and clear all variables and definitions, add a new cell, and reorder existing the cells.

#### Exercise 1.

Execute the cells in the following order:

1.  Run cell 2
2.  Run cell 1
3.  Run cell 3
4.  Run cell 2
5.  Run cell 3
6.  Restart the kernel
7.  Run cell 1
8.  Run cell 2
9.  Run cell 3.
10. Run cell 3.

Observe the results and make notes. Can we execute the cell 2 immediately, why? How does the annotation change when we run a single cell multiple times? What is the value of the \_ expression? You can restart this exercise by restarting the kernel.


Cell 1


In [1]:
a = 5

Cell 2


In [2]:
b = a + 2
b

7

Cell 3


In [3]:
c = a + _
c

12

#### Exercise 2.

Create a new markdown cell directly below this one and use the Markdown language to answer the questions asked in Exercise 1. Use the following features:

- Level 4 heading
- Bullet list
- Bold text


### Cell magic

In order to use a package in your Python script you need to import it like this:


In [4]:
import numpy as np

But what happens when the package is not installed on the machine? Well. Probably you need open the terminal, type an apropriate command and download the package. This even is more complicated when you have no direct access to the machine. In this case we can use something called [cell magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html). Ususally the _Code cell_ is interpreted as a python script. However, we can add a special decorator to change its behaviour. When we add `%%bash` at the beginning of the cell it is going to be executed as if it was a bash terminal. So, in order to install the numpy package (it should be already installed), you can create a cell similar to this one:


In [5]:
%%bash

pip install numpy

DEPRECATION: Loading egg at c:\users\miknowak\appdata\local\programs\python\python312\lib\site-packages\labelimg-1.8.6-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\miknowak\appdata\local\programs\python\python312\lib\site-packages\lxml-5.2.2-py3.12-win-amd64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330





[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


`%%bash` is not the only magical command out there. Sometimes we will compare time of executions of different code variants. In this case we can use the `%%time` or `%%timeit` magic.


In [6]:
import numpy as np

In [7]:
%%time
a = np.zeros((10000,10000))

CPU times: total: 0 ns
Wall time: 0 ns


In [8]:
%%time
a = [[0 for _ in range(10000)] for _ in range(10000)]


CPU times: total: 5.22 s
Wall time: 5.64 s


In [9]:
%%timeit
a = np.zeros((1000,1000))

67.6 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [10]:
%%timeit
a = [[0 for _ in range(1000)] for _ in range(1000)]


32.7 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


A full list of cell magics can be found [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html).


### Toy data sets

During the course we will be using different data sets in order to get familiar with data mining techiques. This section illustrates several techniques of loading up the data sets.

#### Scikit-learn package

Among various packages we are going to use the scikit-learn package (sklearn). Today we will get familiar with the toy data sets, which the package provides. The package provides 7 different data sets (including boston data set, which is deprecated), among them:

- Iris data set - The famous Iris database, first used by Sir R.A. Fisher.
- Digits data set - The data set contains images of hand-written digits: 10 classes where each class refers to a digit.
- Wine data set - The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators.

#### Loading the data set

The datasets are loaded into a dictionary-like structure, [sklearn.utils.Bunch](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html). We use a set of dedicated _load_ functions to load the data sets.


In [11]:
from sklearn.datasets import load_iris, load_breast_cancer, load_digits, load_diabetes, load_linnerud, load_wine

iris_data_set = load_iris()
breast_cancer_data_set = load_breast_cancer()
digits_data_set = load_digits()
diabetes_data_set = load_diabetes()
linnerud_data_set = load_linnerud()
wine_data_set = load_wine()


We can obtain a description of each of the data sets by using the DESCR field.


In [12]:
print(iris_data_set.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

Each data set consists of a list of entries. Each entry is comprised of a set of features. Each feature has a name, which corresponds to its real source. We can obtain the names of features by using the feature_names field.


In [13]:
iris_data_set.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Data in each of the data sets is organized as a numpy array (more on that next week). We can get to it by using the data field.


In [14]:
print(type(iris_data_set.data))
iris_data_set.data[:5]

<class 'numpy.ndarray'>


array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

Each entry corresponds to a certain class, we can obtain names of the classes with use of the target_names field, and the list of classes corresponding to each entry with the target field.


In [15]:
print(iris_data_set.target_names)
print(iris_data_set.target)



['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


### Homework 1.

Write a function, which processes a sci-kit learn Bunch object. The function is expected to prepare a the data set description. The description has the following format:

`Dataset data_set_name.`

`Number of samples: NNN`

`Number of classes: NNN`

`  Number of samples in class target_name1: NNN`

`  Number of samples in class target_name2: NNN`

`  ...`

`Number of features: NNN`

`  Average value of feature feature_name1: NNN`

`  Standard deviation of feature feature_name1: NNN`

`  Average value of feature feature_name2: NNN`

`  Standard deviation of feature feature_name2: NNN`

`  Average value of feature feature_name3: NNN`

`  Standard deviation of feature feature_name3: NNN`

`  ...`


In [16]:
from sklearn.utils import Bunch
import numpy as np

def prepare_dataset_description(data_set: Bunch, data_set_name: str):
    # Extract required data
    data = data_set.data
    target = data_set.target
    feature_names = data_set.feature_names
    num_samples = len(data)
    num_features = data.shape[1]
    
    # Initialize the description
    description = f"Dataset {data_set_name}.\n"
    description += f"Number of samples: {num_samples}\n"
    
    # Handle target names and classes (for classification datasets)
    if 'target_names' in data_set:
        target_names = data_set.target_names
        num_classes = len(target_names)
        
        # For classification, count the number of samples in each class
        if target.ndim == 1:  # Only apply bincount if the target is 1D (classification)
            class_counts = np.bincount(target)
            description += f"Number of classes: {num_classes}\n"
            for i, class_name in enumerate(target_names):
                description += f"  Number of samples in class {class_name}: {class_counts[i]}\n"
        else:
            description += f"Target is multi-output regression with {target.shape[1]} targets.\n"
    else:
        # Handle continuous or multi-output regression targets
        if target.ndim == 2:
            description += f"Target is multi-output regression with {target.shape[1]} targets.\n"
        else:
            description += f"Target is continuous (no classes).\n"
    
    # Feature statistics
    description += f"Number of features: {num_features}\n"
    for i, feature_name in enumerate(feature_names):
        avg = np.mean(data[:, i])
        std = np.std(data[:, i])
        description += f"  Average value of feature {feature_name}: {avg:.2f}\n"
        description += f"  Standard deviation of feature {feature_name}: {std:.2f}\n"
    
    return description

# Example usage:
from sklearn.datasets import load_iris, load_breast_cancer, load_digits, load_diabetes, load_linnerud, load_wine

# Load the datasets
iris_data_set = load_iris()
breast_cancer_data_set = load_breast_cancer()
digits_data_set = load_digits()
diabetes_data_set = load_diabetes()
linnerud_data_set = load_linnerud()
wine_data_set = load_wine()

# Example calls
print(prepare_dataset_description(iris_data_set, 'Iris'))
print(prepare_dataset_description(breast_cancer_data_set, 'BC'))
print(prepare_dataset_description(digits_data_set, 'Digits'))
print(prepare_dataset_description(diabetes_data_set, 'Diabetes'))
print(prepare_dataset_description(linnerud_data_set, 'Linnerud'))
print(prepare_dataset_description(wine_data_set, 'Wine'))


Dataset Iris.
Number of samples: 150
Number of classes: 3
  Number of samples in class setosa: 50
  Number of samples in class versicolor: 50
  Number of samples in class virginica: 50
Number of features: 4
  Average value of feature sepal length (cm): 5.84
  Standard deviation of feature sepal length (cm): 0.83
  Average value of feature sepal width (cm): 3.06
  Standard deviation of feature sepal width (cm): 0.43
  Average value of feature petal length (cm): 3.76
  Standard deviation of feature petal length (cm): 1.76
  Average value of feature petal width (cm): 1.20
  Standard deviation of feature petal width (cm): 0.76

Dataset BC.
Number of samples: 569
Number of classes: 2
  Number of samples in class malignant: 212
  Number of samples in class benign: 357
Number of features: 30
  Average value of feature mean radius: 14.13
  Standard deviation of feature mean radius: 3.52
  Average value of feature mean texture: 19.29
  Standard deviation of feature mean texture: 4.30
  Average 