# Data

```{epigraph}
In God we trust, all others bring data.

-- William Edwards Deming 
```

**Data** is a broad term that refers to facts, statistics, or information in a raw, unprocessed, or organized form. Data can take many forms, including numbers, text, images, audio recordings, and more.

## Data processing

The process of preparing raw data for machine learning involves several stages of data processing and manipulation to transform it into a structured and suitable format. The most common stages are:

* data collection;
* data cleaning:
    * handling missing values;
    * remove duplicates;
    * outlier detection;
    * data type conversions;
* data exploration and visualization;
* feature engineering.

The result of these manipulation is what is usually called a **dataset**: a specific collection of data that is organized and structured in a way that makes it suitable for analysis, processing, or machine learning tasks. 

## Data types

```{image} https://i.ytimg.com/vi/7bsNWq2A5gI/hqdefault.jpg
:alt: data-types
:class: bg-primary mb-1
:width: 500px
:align: center
```

### Numerical continuous data

Continuous data can take on any real[^real] value within a range and often involves measurements. For instance:

* height
* temperature
* distance
* time

### Numerical discrete data

Discrete data consists of distinct, separate values and often involves counts or categorizations, e.g.

* number of children
* shoe size
* test scores

```{important} 
The distiction between continuous and discrete data can be occasionally ambiguous. For example, `age` in years probably should be considered as a discrete variable. However, if we allow fractional ages, e.g. $30.2$ years, it becomes a continuous variable.
```

### Categorical nominal variables

Nominal data consists of categories with no inherent order or ranking. For example:

* colors
* fruits
* gender
* countries

### Categorical ordinal variables

Ordinal data includes categories with a meaningful order or ranking. Examples:

* education level
* customer satisfaction
* movie rating
* top-10 items suggested by a search engine

[^real]: In fact, only a finite amount of rational numbers is used in practice. However, the real numbers $\mathbb R$ so firmly took place of the default number set that it seems unreasonable to make any attempts to substitute them by something more realistic.


## Examples of datasets

There are several way how you can import some famous datasets in Python. For instance, we can use helpers from [sklearn.datasets](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) module.

In [15]:
from sklearn.datasets import load_iris
data = load_iris()
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')