In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

plt.style.use("seaborn-v0_8")

### Dataset
- *Attribute/feature/dimension/variable* is a dataset column (*p* or *m* or *d*)
- *Observation/sample* is a dataset row (*n* or *N*)

### Data types
| Data Type  |   | Description | Operators | Examples |
|-------------|---| ----------------|------|------|
| Categorical | ***Nominal*** | Set of labels, all equally important | = ≠  | Zip code, eye color, sex, ... | 
|             | ***Ordinal*** | Set of lables, sortable for some metric | < > ≤ ≥ | Hardness of minerals, non–numerical quality evaluations (bad, fair, good, excellent) | 
| Numerical   | ***Interval*** | Is meaningful only the difference (there's not a univocal definition of zero) | + − | Calendar dates, temperature (F and C), scores (e.g. intervals of 1 to 10) |
|             | ***Ratio*** | Have a univocal definition of zero | all | Temperatures (Kelvin), masses, length, counts |



### Data characteristics
| | |
| --- | --- |
| **Asymmetric attribute** | a value is important only if it's present (not null) |
| **Sparsity** | when there are many zeros/nulls |
| **Noise** | modification of original values, or adding of unintresting values |
| **Outliers** | points very different from the most. Can be caused by noise or rare events. |

### Similarity/Dissimilarity
between two values $p$ and $q$

| Attribute type          | Dissimilarity         | Similarity               |
|------------------------|----------------------|-------------------------|
| Nominal | $d = \begin{cases} 0 & \text{if } p = q \\ 1 & \text{if } p \neq q \end{cases}$ | $s = \space \sim d$ |
| Ordinal <br> (values mapped to integers [0, V-1]) | $d = \frac{\|p-q\|}{V-1}$ | $s = 1 -d$ |
| Interval or Ratio      | $d = \|p - q\|$       | $s =\frac{1}{1+d}$ <br> or <br>$s= 1-\frac{d-min(d)}{max(d)-min(d)}$ |


### Data statistics

|          | **Numerical** | **Nominal** | **Ordinal** | 
|----------|-------------|-------------|------------|
| **Mean** |$$\frac{\sum_{i=1}^{n} x_i}{n}$$ The outliers can easily modify the mean|no sense|no sense|
| **Median**|$$\frac{n + 1}{2}$$|no sense|middle value in the ordered sequence (if number of values is even, is the average of the two middlemost values)|
| **Mode**  |Most occurring value. <br>There can be multiple modes, or none if each value occurs once| == | == |
| **Quantiles** | Points taken at regular intervals of a data distribution.<br> - the 4-quantiles are three data points that split the distribution into four equal parts (*quartiles*) <br>- the 100-quantiles divide into 100 set (*percentiles*)| == | == |
| **Variance and Standard deviation** (square root of Variance)| Measures data dispersion: if it's low, the data are close to the mean value $\bar{y}$; if it's zero, all observations have the same value: $$ \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2 $$ | == | == |
| **Covariance** (between two variables)| It's a measure of the joint variability of two variables. <br>The covariance is *positive* if the larger values of one variable mainly correspond with the larger values of the other variable, and the same holds for the lesser values (that is, the variables tend to show similar behavior). <br>The covariance is *negative* when the greater values of one variable mainly correspond to the lesser values of the other (that is, the variables tend to show opposite behavior). <br>If the covariance is zero, it means that the two dimensions are independent of each other. <br>It cannot show the strength of the relationship between variables. $$ \sigma_{x,y} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) $$<br> The **covariance matrix** is a square matrix and it's symmetric with respect to the main diagonal. Each element is the result of calculating the covariance between two separate dimensions. On the main diagonal, the covariance value is calculated between a dimension and itself, that is simpy the variance. | == | == |
| **Correlation** (between two variables) | It's a standardized measure of covariance that ranges from -1 to 1. <br>A correlation value of *1* indicates a perfect positive linear relationship between the two variables.<br> A correlation value of *-1* indicates a perfect negative linear relationship. <br> A correlation value of *0* indicates the absence of a linear relationship between the two random variables. <br>It measures both the strength and direction of the linear relationship between two variables. | == | == |

---
---
# Data Type Conversions

In [14]:
df = pd.DataFrame({"League": ["NBA", "NBA", "Eurocup"], "Stage": ["Regular_Season", "Playoffs", "International"]})
display(df)

Unnamed: 0,League,Stage
0,NBA,Regular_Season
1,NBA,Playoffs
2,Eurocup,International


---
### Nominal to Numerical
&rarr; **One Hot Encoding**
- Assume that one variable can assume k different values. The variable is replaced with k columns where each one represents one possible value of the variable. The cell number of the k-th column is equal to "1" if the original variable assumes that value, or "0" otherwise

In [15]:
pd.get_dummies(df)  # the values are True/False

Unnamed: 0,League_Eurocup,League_NBA,Stage_International,Stage_Playoffs,Stage_Regular_Season
0,False,True,False,False,True
1,False,True,False,True,False
2,True,False,True,False,False


In [16]:
OneHotEncoder(sparse_output=False).fit_transform(df)  # the values are binary (0/1)

array([[0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [1., 0., 1., 0., 0.]])

In [17]:
# General solution

# Apply OneHotEncoder to columns 0,1. "remainder=passthrough": the other columns are left the same
ColumnTransformer([("encoder", OneHotEncoder(), [0, 1])], remainder="passthrough").fit_transform(df)

array([[0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [1., 0., 1., 0., 0.]])

---
### Ordinal to Numerical
&rarr; **Integer Encoding**
- The ordered sequence is transformed into consecutive integers

In [18]:
# Target encoding (one column)
LabelEncoder().fit_transform(df.Stage)

array([2, 1, 0])

In [19]:
# Attributes encoding (multiple columns)
OrdinalEncoder().fit_transform(df)

array([[1., 2.],
       [1., 1.],
       [0., 0.]])

---
### Numerical to Binary

In [20]:
from sklearn.preprocessing import Binarizer

X = pd.DataFrame([[-10, 10, 200], [3, 1, 0]])
Binarizer(threshold=1).fit_transform(X)

array([[0, 1, 1],
       [1, 0, 0]])

In [21]:
X > 0

Unnamed: 0,0,1,2
0,False,True,True
1,True,True,False


---
### Numerical Discretization

In [22]:
from sklearn.preprocessing import KBinsDiscretizer

X = pd.DataFrame([[-10, -10, 50], [20, 20, -50]])
disc = KBinsDiscretizer(n_bins=2, encode="onehot-dense", subsample=None)
disc.fit_transform(X)

array([[1., 0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 1., 0.]])

---
---
- For *incomplete data* (lacking feature values):
    - Ignore the sample if the sample contains several attributes with missing values
    - Manually fill in the missing values (Time-consuming)
    - Use a global constant value to fill in the missing values
        - Replace all missing attribute values with the same constant (i.e. -1). The ML algorithm may (mistakenly) interpret this as a pattern(several samples with a value in common)
    - Use a measure like the mean (or median) to fill in the missing values:
        - for normal (symmetric) data distributions => mean/median (are equals)
        - for skewed data distribution => median
        - It can be used the feature mean/median for all samples belonging to the same class

- For *noisy data* (containing errors, or values that deviate from the expected):
    -  Smooth a sorted data value by looking at the values around it (neighborhood) => **Binning**
        - The sorted values are distributed in bins (buckets) with equal width. For example, integers ordered in [0,5]: [0,1] [2,3] [4,5]
        - Smoothing by bin means: each value in a bin is replaced by the mean value of the bin
        - Smoothing by bin medians: each bin value is replaced by the bin median 
        - Smoothing by bin boundaries: the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is replaced by the closest boundary value

- For *inconsistent data* (feature duplication, sample duplication, human error in data entry, deliberate errors, inconsistent data representations, inconsistent use of codes, errors from the devices that record data, system errors): ...