# Table of Contents 

Major Tasks in data preprocessing:
- Data cleaning
    - **<font color=blue>[Handling missing values (on HYPOTHYROID dataset)](#Data-cleaning:-handling-missing-value)</font>** ✅
    - Smooth noisy data
    - Identify or remove outliers
    - Resolve inconsistencies
    - Encoding categorical features 
- Data integration
- Data reduction
    - Dimensionality reduction
    - Numerosity reduction
        - Parametric methods
        - Non parametric methods
            - Sampling
                - Simple random sampling with replacement
                - Simple random sampling without replacement
- Data transformation and data discretization
    - Normalization
    - Concept Hierarchy Generation
    - **<font color=blue>[Discretization with pandas (on IRIS dataset)](#Discretization-with-pandas)</font>** ✅
        - <font color=blue>[Equal width binning](#Equal-width-discretization)</font> ✅
        - <font color=blue>[Equal frequency binning](#Equal-frequency-discretization)</font> ✅

In [None]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt

# Data cleaning: handling missing value
Main strategies:
- **ignoring the tuple**
    - ⚠️ risk of shrinking the dataset too much. Problematic, for the downstream ML algorithm
- **filling missing value manually**
    - ⚠️ typically infeasible/tedious
- **imputing missing value (filling automatically)**
    - ⚠️ arbitrary

### The hypothyroid dataset

Thyroid disease records supplied by the Garavan Institute and J. Ross Quinlan, New South Wales Institute, Syndney, Australia.

Missing values are reported with symbol "?", in the original dataset

|column|values|
|---|---|
| age|continuous (int)|
| sex|M, F|
| on thyroxine|f, t|
| query on thyroxine|f, t|
| on antithyroid medication|f, t|
| sick|f, t|
| pregnant|f, t|
| thyroid surgery|f, t|
| I131 treatment|f, t|
| query hypothyroid|f, t|
| query hyperthyroid|f, t|
| lithium|f, t|
| goitre|f, t|
| tumor|		f, t|
| hypopituitary|			f, t|
| psych|				f, t|
| TSH measured|			f, t|
| TSH|		continuous|
| T3 measured|			f, t|
| T3|				continuous|
| TT4 measured|			f, t|
| TT4|	continuous|
| T4U measured|			f, t|
| T4U|	continuous|
| FTI measured|			f, t|
| FTI|	continuous|
| TBG measured|			f, t|
| TBG|				continuous|
| referral source|		WEST, STMW, SVHC, SVI, SVHD, other|
| Class| hypothyroid, primary hypothyroid, compensated hypothyroid, secondary hypothyroid, negative|

Load the hypothyroid dataset    

In [None]:
df = pd.read_csv('dataset/hypothyroid.csv', na_values = "?")
df

In [None]:
df.head().T

In [None]:
df.describe().T

In [None]:
df.describe(include = 'object').T

In [None]:
df.isna().sum(axis=0)

Attribute removal:
- "*TBG*" column has only missing values - we can drop it
- "*TBG measured*" has only one unique value - we can drop it	

In [None]:
df = df.drop(['TBG', 'TBG measured'], axis = 1)
df

<u>Handling missing values: **ignoring the tuple**</u>


In [None]:
df_notna = df.dropna(how = 'any')
print(df_notna.isna().sum())
df_notna

Notice that with this strategy we lost around 30% of the tuples

In [None]:
len(df_notna) / len(df)

Variation of classes distribution.

In [None]:
fig,axes = plt.subplots(1, 2, figsize = (10, 5), sharey = True)

sns.countplot(x = 'Class', 
              data = df, 
              ax = axes[0], 
              order = df_notna['Class'].value_counts().index)
axes[0].tick_params(axis = 'x', labelrotation = 45)
axes[0].set_title('Before removal of tuples with NaN values')

sns.countplot(x = 'Class', 
              data = df_notna, 
              ax = axes[1], 
              order = df_notna['Class'].value_counts().index)
axes[1].tick_params(axis = 'x', labelrotation = 45)
axes[1].set_title('After removal of tuples with NaN values')

plt.show()

Variation of attribute statistics, grouped by class.

In [None]:
fig, axes = plt.subplots(1, 2, figsize = (10, 5), sharey = True)

sns.barplot(y = 'age',
            x = 'Class',
            data = df, 
            ax = axes[0], 
            order = df_notna['Class'].value_counts().index)
axes[0].tick_params(axis = 'x', labelrotation = 90)
axes[0].grid()
axes[0].set_title('Before removal of tuples with NaN values')

sns.barplot(y = 'age', 
            x = 'Class', 
            data = df_notna, 
            ax = axes[1], 
            order = df_notna['Class'].value_counts().index)
axes[1].tick_params(axis = 'x', labelrotation = 90)
axes[1].set_title('After removal of tuples with NaN values')
axes[1].grid()

plt.show()

<u>Handling missing values: **imputing missing value (filling automatically)**</u>

Create a novel ad-hoc category

In [None]:
df.sex.fillna('NS').value_counts()

In [None]:
df.sex.value_counts()

Categorical attribute: assign the most frequent value

In [None]:
df.sex.mode()[0]

In [None]:
df.sex.fillna(df.sex.mode()[0]).value_counts()

In [None]:
df.sex.value_counts()

Numerical attribute: assign the mean value

In [None]:
df[df.age.isna()]

In [None]:
print(df.age.mean())
df['age'].fillna(df.age.mean())[1985]

More imputing strategies available in `scikit-learn` (only for numeric attributes). We will extensively cover `scikit-learn` in the next lectures. In the following, a simple example is reported.

In [None]:
import numpy as np
from sklearn.impute import KNNImputer, SimpleImputer

In [None]:
X = np.asarray([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7], [3, 6, 5], [1, 3, 5], [2, 7, 5]])
print(X)

**Simple imputer**: univariate imputer for completing missing values with simple strategies.

Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

In [None]:
imputer = SimpleImputer() # by default, it uses the "mean"
imputer.fit_transform(X)

**KNN-imputer**: Imputation for completing missing values using k-Nearest Neighbors.

Each sample’s missing values are imputed using the mean value from *n_neighbors* nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

In [None]:
imputer = KNNImputer(n_neighbors = 2)
imputer.fit_transform(X)

**Iterative imputer**: Multivariate imputer that estimates each feature as a function of all the others.


*This estimator is still experimental for now: the predictions and the API might change without any deprecation cycle.*

# Data Discretization


Load the IRIS dataset.

In [None]:
df = pd.read_csv(os.path.join('dataset', 'iris.csv'))
df

## Discretization with pandas

Sometimes we may need to transform a continuous variable to a categorical one. For example, we may want to convert ages to groups of age ranges.

`Pandas` provides two functions for binning values of a continuous variable into discrete intervals. Let *x* be the input array to be binned.
- `pd.cut(x, bins, ...)`: generic function for binning
    - bins:
        - *int*: number of bins (support for **equal-width** binning)
        - *sequence of scalars*: bins edges (support for **non-uniform** width binning) 
- `pd.qcut(x, q, ...)`: quantile-based discretization function.
    - q:
        - *int*: number of quantiles (support for **equal-frequency** binning)
        - *list of float*: array of quantiles (support for **custom-frequency** binning) 

### Equal-width discretization

In [None]:
df['sepallength_cat'], bins = pd.cut(df.sepallength, 5, retbins = True)
df

Plot the novel categorical variable:

In [None]:
df.info()

In [None]:
sns.countplot(x = 'sepallength_cat', 
              hue = 'sepallength_cat', 
              data = df, 
              palette = 'pastel', 
              legend = False)
plt.show()

Plot the novel categorical variable, by class:

In [None]:
sns.countplot(x = 'sepallength_cat',
              hue = 'class',
              data = df,
              palette = 'pastel')
plt.show()

How do the bins look like?

In [None]:
print(bins)

In [None]:
sns.histplot(x = 'sepallength', data = df)
for edge in bins:
    plt.axvline(edge,
                color = 'k',
                linestyle = '--',
                linewidth = 3)

### Equal-frequency discretization

In [None]:
# suppose that we want to end up with 4 categories
nbins = 4 

df['petallength_cat'], bins = pd.cut(df.petallength, nbins, retbins = True)
df['petallength_qcat'], qbins = pd.qcut(df.petallength, nbins, retbins = True)
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include = 'category')

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (10, 8), sharey = True)

# axes[0][0]: top-left
sns.countplot(x = 'petallength_cat', 
              hue = 'petallength_cat', 
              data = df, 
              palette = 'pastel', 
              ax = axes[0][0], 
              legend = False)
axes[0][0].set_title('equal width binning')

# axes[0][1]: top-right
sns.countplot(x = 'petallength_qcat', 
              hue = 'petallength_qcat', 
              data = df, 
              palette = 'pastel', 
              ax = axes[0][1], 
              legend = False)
axes[0][1].set_title('equal freq binning')

# axes[1][0]: bottom-left
sns.histplot(x = 'petallength',
             data = df,
             ax = axes[1][0])
for edge in bins:
    axes[1][0].axvline(edge, 
                       color = 'k',
                       linestyle = '--',
                       linewidth = 3)
    
# axes[1][1]: bottom-left
sns.histplot(x = 'petallength',
             data = df,
             ax = axes[1][1])
for edge in qbins:
    axes[1][1].axvline(edge,
                       color = 'k',
                       linestyle = '--',
                       linewidth = 3)

# Improve subplot size/spacing
fig.tight_layout() 

**binarization** can be simply obtained using `cut()` and `qcut()` by setting nbins = 2

In [None]:
df['sepalwidth_bin'], bins = pd.cut(df.sepalwidth, 2, retbins = True)
df.sepalwidth_bin.value_counts()

In [None]:
df['sepalwidth_qbin'], bins = pd.qcut(df.sepalwidth, 2, retbins = True)
df.sepalwidth_qbin.value_counts()

Alternatively, we can use a simple threshold function.

In [None]:
df['sepalwidth_bin_custom'] = (df["sepalwidth"] <= 3.0).astype(int)
df.sepalwidth_bin_custom.value_counts()

In [None]:
df['sepalwidth_bin_custom'] =  (df["sepalwidth"] <= 2.9).astype(int)
df.sepalwidth_bin_custom.value_counts()

`qcut()` does its best to get equifrequent bins, but of course it depends on the data and the numerical precision.

### From categorical to numerical
Note that ordinal / interval scaled variable should be treated as such for downstream elaboration (e.g., similarity/distance evaluation).

In [None]:
df

Consider, for example, the breast_cancer dataset:

| column | values |
| --- | --- |
| Class | no-recurrence-events, recurrence-events |
| age | 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99|
| menopause | lt40, ge40, premeno|
| tumor-size | 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59|
| inv-nodes | 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39|
| node-caps | yes, no|
| deg-malig | 1, 2, 3|
| breast | left, right|
| breast-quad | left-up, left-low, right-up, right-low, central|
| irradiat | yes, no|
 




Suppose that a iris_cat dataset is represented as follows.

In [None]:
categorical_variables = ['sepallength_cat', 'petallength_cat']

In [None]:
iris_cat_df = df[categorical_variables].copy() # a new object will be created with a copy of the calling object’s data and indices.
iris_cat_df

In [None]:
iris_cat_df['sepallength_mean'] = iris_cat_df.sepallength_cat.apply(lambda x: x.mid)
iris_cat_df['petallength_mean'] = iris_cat_df.petallength_cat.apply(lambda x: x.mid)


In [None]:
iris_cat_df

In [None]:
iris_num = iris_cat_df.iloc[:, -2:]
iris_num

The *mid* attribute is convenient when we have pandas CategoricalDtype.

In general, we can define custom mappings: e.g., how could we handle the age variable of breast_cancer, in which intervals are represented as strings?

```python
dict_categories: {"10-19":15, "20-29":25, "30-39":35, "40-49":45, "50-59":55, "60-69":65, "70-79":75, "80-89":85, "90-99":95}
```