Throughout the first part of the course ("traditional machine learning") we will mainly use the following Python packages or modules:

- **numpy**: for storing and manipulating multidimensional arrays 
- **pandas**: library built on top of Numpy providing higher-level data manipulation tools
- **scikit-learn**: main machine learning library for traditional algorithms - `pip install scikit-learn`
- **matplotlib**: the most used library for data visualization

We import them and check their version number by printing the <code>\_\_version\_\_</code> attribute. It's just a check that all have the same version of the libraries.

By importing modules, we can use methods, classes and functions defined in the libraries.

In [9]:
import sklearn
import numpy as np
import pandas as pd
import matplotlib as mpl
import scipy as sp
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [10]:
print(np.__version__, sp.__version__, sklearn.__version__, mpl.__version__, pd.__version__)

1.26.4 1.12.0 1.4.2 3.8.3 2.2.1


The most important library for machine learning is **scikit-learn**, a de-facto standard for developing ML-oriented projects. 
Here we introduced some characteristics of the library which are helpful for the data preprocessing phase in a ML project. Moreover, we also introduce two other important modules: **pandas** and **numpy**. The former plays an important role in data preprocessing, too.
![](slides/Slide3.png)
![](slides/Slide4.png)

# Class Hierarchy in Scikit-Learn

![](slides/Slide5.png)
![](slides/Slide6.png)

# Data Preprocessing Pipeline

![](slides/Slide8.png)


Before facing the different steps in data preprocessing, we get the data from a data source, a CSV file, organized as a table.

### Loading data

We use the module pandas - renamed **pd** - to get the data and transform the CSV file into a table, namely a **DataFrame** object. A DataFrame is an object modeling tabular data. It's made by rows and each row is described by a set of columns. Each column has a name.

**Task**: Read a CSV file and transform the data in a DataFrame. To accomplish the task we use the class method [**read_csv**](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). The file is in the folder *data* and it's called *playground.csv*.

In [11]:
playground = pd.read_csv('data/playground.csv')

### Preliminary inspection

In DeepNote, we can visualize the DataFrame by writing its name.

In [12]:
playground

Unnamed: 0,age,length_screw,diameter,middle_diameter,hammer_strength,rank,size,country
0,47.0,39.537837,17.053504,4.301665,46.389102,medium,L,Denmark
1,20.0,-1.034592,-67.283321,,-167.451060,good,XXL,Denmark
2,72.0,40.686959,-21.411874,2.578786,-34.345003,medium,L,Austria
3,23.0,60.179331,39.712469,3.065198,118.023664,perfect,XL,Austria
4,93.0,2.862281,33.059898,,99.476390,perfect,XL,Germany
...,...,...,...,...,...,...,...,...
9995,80.0,29.688847,-8.900835,5.830887,-15.037037,good,XXL,Denmark
9996,35.0,23.757652,54.140757,3.128091,124.841787,medium,L,Germany
9997,31.0,7.493367,-32.458441,2.022378,-83.878678,perfect,L,France
9998,33.0,33.341941,-36.872158,,-84.604433,good,M,Austria


As we can see the header of the table reports a detailed overview of the columns, comprehensive of missing values and data distribution. Similar information can be returned by functions: **info()** and **describe()**.

In [13]:
playground.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              9990 non-null   float64
 1   length_screw     9970 non-null   float64
 2   diameter         9990 non-null   float64
 3   middle_diameter  4994 non-null   float64
 4   hammer_strength  9970 non-null   float64
 5   rank             9990 non-null   object 
 6   size             9950 non-null   object 
 7   country          9990 non-null   object 
dtypes: float64(5), object(3)
memory usage: 625.1+ KB


In [14]:
playground.describe(include='all')

Unnamed: 0,age,length_screw,diameter,middle_diameter,hammer_strength,rank,size,country
count,9990.0,9970.0,9990.0,4994.0,9970.0,9990,9950,9990
unique,,,,,,4,5,7
top,,,,,,perfect,L,Canada
freq,,,,,,2553,2034,1459
mean,57.82042,29.885444,1.462067,3.006871,14.701874,,,
std,23.368789,20.103509,40.196593,1.996277,101.184493,,,
min,18.0,-45.443865,-154.473963,-4.0792,-376.36717,,,
25%,37.0,16.177003,-25.098778,1.684077,-52.087874,,,
50%,58.0,29.801311,1.209477,2.995461,14.201412,,,
75%,78.0,43.547181,28.115844,4.344913,81.566187,,,


While for numerical colums we get a quite complete overview of the features, for categorical columns we get less information. To obtain a distribution of the unique values in a specific column, we use the **value_counts()** method. It returns the frequency of each category in a column.

In [15]:
playground['size'].value_counts()

size
L      2034
XXL    2011
M      1989
S      1976
XL     1940
Name: count, dtype: int64

Here we select a specific column or a subset of columns using the indexing syntax
```python
df['column name']
```
or
```python
df[['col_name_1',...,'col_name_k']]
```

### Missing data: removal and imputation
![](slides/Slide9.png)

**Task**: Get the boolean DataFrame indicating where the missing value are, and visualize the first ten rows using the method [**head**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html). 

In [16]:
playground.isnull().head(10)

Unnamed: 0,age,length_screw,diameter,middle_diameter,hammer_strength,rank,size,country
0,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,True,False,False,False,False
7,False,False,False,True,False,False,False,False
8,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False


If you have a boolean DataFrame, you can count the number of missing values per column by the method [**sum**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html).

In Python, <code>True = 1</code> and <code>False = 0</code>.

**Task**: Count the missing values per column.

In [17]:
playground.isnull().sum(axis=0)

age                  10
length_screw         30
diameter             10
middle_diameter    5006
hammer_strength      30
rank                 10
size                 50
country              10
dtype: int64

**Task (Intermediate)**: Inspect the documentation of <code>sum</code>, how can you get the number of rows containing at least one mssing value?

*Hint:* If we evaluate a comparison operator like <code>df > value</code>, we get a boolean DataFrame. The same holds for **Series** - pandas.Series, aka a single column or row.

In [18]:
(playground.isnull().sum(axis=1) > 0).sum()

5039

![](slides/Slide10.png)

Documentation for [**dropna**](https://pandas.pydata.org/pandas-docs/version/1.2/reference/api/pandas.DataFrame.dropna.html).

**Task:** Remove all rows containing at least one missing value. Is it a good idea? Why?

In [19]:
playground.dropna(axis=0)

Unnamed: 0,age,length_screw,diameter,middle_diameter,hammer_strength,rank,size,country
0,47.0,39.537837,17.053504,4.301665,46.389102,medium,L,Denmark
2,72.0,40.686959,-21.411874,2.578786,-34.345003,medium,L,Austria
3,23.0,60.179331,39.712469,3.065198,118.023664,perfect,XL,Austria
5,65.0,52.287209,12.532527,4.092027,44.922023,bad,M,Austria
8,65.0,51.078399,-22.971850,4.033592,-33.540389,good,S,Denmark
...,...,...,...,...,...,...,...,...
9994,76.0,37.231863,-23.314827,2.293006,-38.269593,perfect,S,Italy
9995,80.0,29.688847,-8.900835,5.830887,-15.037037,good,XXL,Denmark
9996,35.0,23.757652,54.140757,3.128091,124.841787,medium,L,Germany
9997,31.0,7.493367,-32.458441,2.022378,-83.878678,perfect,L,France


**A:** It's a good solution because we removed 5039 rows, i.e. half of the dataset.

Don't worry if you removed many rows from the original DataFrame, it is still untouched because most of methods changing a DataFrame return a new DataFrame, unless we defined the parameter <code>inplace=True</code>.

**Task**: Remove all the features containing at least one missing value. Is it a good idea? Can we do better?

In [20]:
playground.dropna(axis=1)

0
1
2
3
4
...
9995
9996
9997
9998
9999


**A:** We removed all the rows. We got an empty dataset. No good.

The problem is the column *middle_diameter* since it contains a lot of missing value. We remove it by exploiting the parameter **thresh** in the method **dropna**.

In [21]:
playground = pd.read_csv('data/playground.csv')

In [22]:
playground = playground.dropna(thresh=6000,axis=1).dropna(axis=0,how='all')

We reomved the column *middle_diameter* and, on the resulting new DataFrame, we removed all the rows full of missing data. 

In [23]:
playground.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9990 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              9990 non-null   float64
 1   length_screw     9970 non-null   float64
 2   diameter         9990 non-null   float64
 3   hammer_strength  9970 non-null   float64
 4   rank             9990 non-null   object 
 5   size             9950 non-null   object 
 6   country          9990 non-null   object 
dtypes: float64(4), object(3)
memory usage: 624.4+ KB


Again, a few columns contain missing values: *length_screw*, *hammer_strength* and *size*. But the overall quantity of missing values is very limited. For these cases it is more suitable **imputing** the missing values.

![](slides/Slide11.png)

**Task**: Apply an imputation with strategy *most_frequent* to the feature *size*. Use the <code>SimpleImputer</code> class. Why the feature *size* and not *rank*?


In [24]:
from sklearn.impute import SimpleImputer

In [25]:
playground['size'].value_counts()

size
L      2034
XXL    2011
M      1989
S      1976
XL     1940
Name: count, dtype: int64

In [26]:
si = SimpleImputer(strategy='most_frequent')
playground['size'] = si.fit_transform(playground[['size']])

ValueError: 2

**Task:** Apply imputation with strategy *mean* to the features *hammer_strength* and *length_screw*. Use the <code>SimpleImputer</code> class.

In [27]:
si_mean = SimpleImputer(strategy='mean')
playground[['length_screw','hammer_strength']] = si_mean.fit_transform(playground[['length_screw','hammer_strength']])

In [28]:
playground

Unnamed: 0,age,length_screw,diameter,hammer_strength,rank,size,country
0,47.0,39.537837,17.053504,46.389102,medium,L,Denmark
1,20.0,-1.034592,-67.283321,-167.451060,good,XXL,Denmark
2,72.0,40.686959,-21.411874,-34.345003,medium,L,Austria
3,23.0,60.179331,39.712469,118.023664,perfect,XL,Austria
4,93.0,2.862281,33.059898,99.476390,perfect,XL,Germany
...,...,...,...,...,...,...,...
9995,80.0,29.688847,-8.900835,-15.037037,good,XXL,Denmark
9996,35.0,23.757652,54.140757,124.841787,medium,L,Germany
9997,31.0,7.493367,-32.458441,-83.878678,perfect,L,France
9998,33.0,33.341941,-36.872158,-84.604433,good,M,Austria


### Transform Categorical Data

#### The pandas Way

![](slides/Slide12.png)
![](slides/Slide13.png)

#### The SKLearn Way

![](slides/Slide14.png)

Documentation for: [**OrdinalEncoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder).
Documentation for: [**OneHotEncoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder).

One of the parameters in the <code>OrdinalEncoder</code> constructor is *categories*. It is used to specify the order of the categories in the $i$-th feature. Specifically we can pass a list of lists/array.

Example:
```python
oe = OrdinalEncoding(categories = [['A','B','C'])
```
0 - index of the element 'A' - will replace A, 1 - index of element 'B' - will replace B, and so.

In [29]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

**Task**: Apply the previous encodings but using SKLearn classes.

In [34]:
#print(playground)
oe_size = OrdinalEncoder(categories = [['S','M','L','XL','XXL']])
oe_size.fit_transform(playground[['size']])
oe_rank = OrdinalEncoder(categories = [['bad','medium','good','perfect']])
oe_rank.fit_transform(playground[['rank']])

ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.

In [35]:
playground['rank'][:4]

0     medium
1       good
2     medium
3    perfect
Name: rank, dtype: object

In [36]:
ohe_country = OneHotEncoder()
temp = ohe_country.fit_transform(playground[['country']]).toarray()
names = ohe_country.get_feature_names_out()
playground[list(names)] = temp 

In [37]:
playground.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9990 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              9990 non-null   float64
 1   length_screw     9990 non-null   float64
 2   diameter         9990 non-null   float64
 3   hammer_strength  9990 non-null   float64
 4   rank             9990 non-null   object 
 5   size             9950 non-null   object 
 6   country          9990 non-null   object 
 7   country_Austria  9990 non-null   float64
 8   country_Canada   9990 non-null   float64
 9   country_Denmark  9990 non-null   float64
 10  country_France   9990 non-null   float64
 11  country_Germany  9990 non-null   float64
 12  country_Italy    9990 non-null   float64
 13  country_Spain    9990 non-null   float64
dtypes: float64(11), object(3)
memory usage: 1.1+ MB



![](slides/Slide15.png)
![](slides/Slide16.png)


**Task**: Apply a min-max scaling to the feature *age*.

In [39]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [40]:
mm_scaler = MinMaxScaler()
mm_scaler.fit_transform(playground[['age']])

array([[0.3625],
       [0.025 ],
       [0.675 ],
       ...,
       [0.1625],
       [0.1875],
       [0.725 ]])

**Task**: Apply a standardization to the features *length_screw*, *diameter* and *hammer_strength*.

In [41]:
sscaler = StandardScaler()
sscaler.fit_transform(playground[['length_screw','diameter','hammer_strength']])

array([[ 0.48064018,  0.38789898,  0.31349256],
       [-1.53966089, -1.71031482, -1.80210112],
       [ 0.5378606 , -0.56908024, -0.48523749],
       ...,
       [-1.11501183, -0.8439075 , -0.97529103],
       [ 0.17211597, -0.95371625, -0.98247117],
       [-0.07080558,  0.51831827,  0.40172249]])

A complete overview of the most important characteristics of the features and the required transformation has been hereby depicted
![](figures/ColumnTransformerOverview.jpg)

Here, for each column we indicate the chain of transformations. Each chain will be merged into a big transformation corresponding to the preprocessing phase.

### Pipeline
![](slides/Slide17.png)

In our case, we have to implement two pipelines composed by two transformers.
![](slides/Slide18.png)

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
playground = pd.read_csv('data/playground.csv')
playground = playground.dropna(thresh=6000,axis=1).dropna(axis=0,how='all')

**Task**: Define the first pipeline composed by:
1. SimpleImputer with strategy = 'mean'
2. StandardScaler

The choice of names is up to you.

In [None]:
imp_scaler = Pipeline(
    [
        ('pipe1_si',SimpleImputer(strategy='mean')),
        ('pipe1_scaler',StandardScaler())
    ]
)
imp_scaler.fit_transform(playground[['length_screw','hammer_strength']])

array([[ 0.48064018,  0.31349256],
       [-1.53966089, -1.80210112],
       [ 0.5378606 , -0.48523749],
       ...,
       [-1.11501183, -0.97529103],
       [ 0.17211597, -0.98247117],
       [-0.07080558,  0.40172249]])

In [None]:
imp_scaler.get_feature_names_out()

array(['length_screw', 'hammer_strength'], dtype=object)

In [None]:
imp_scaler.get_params()

{'memory': None,
 'steps': [('pipe1_si', SimpleImputer()), ('pipe1_scaler', StandardScaler())],
 'verbose': False,
 'pipe1_si': SimpleImputer(),
 'pipe1_scaler': StandardScaler(),
 'pipe1_si__add_indicator': False,
 'pipe1_si__copy': True,
 'pipe1_si__fill_value': None,
 'pipe1_si__missing_values': nan,
 'pipe1_si__strategy': 'mean',
 'pipe1_si__verbose': 'deprecated',
 'pipe1_scaler__copy': True,
 'pipe1_scaler__with_mean': True,
 'pipe1_scaler__with_std': True}

**Task:** Define the second and the third pipeline composed by:
*Pipeline2*:
1. SimpleImputer with strategy = 'most_frequent'
2. OrdinalEncoder with categories = [['S','M','L','XL','XXL']]
3. MinMaxScaler

*Pipeline3*
1. OrdinalEncoder with categories = [['bad','medium','good','perfect']]
2. MinMaxScaler

In [None]:
imp_ordinal = Pipeline(
    [
        ('pipe2_si', SimpleImputer(strategy='most_frequent')),
        ('pipe2_ordinal', OrdinalEncoder(categories=[['S','M','L','XL','XXL']])),
        ('pipe2_mm',MinMaxScaler())
    ]
)

imp_ordinal2 = Pipeline(
    [
        ('pipe3_ordinal', OrdinalEncoder(categories=[['bad','medium','good','perfect']])),
        ('pipe3_mm',MinMaxScaler())
    ]
)


Once we defined the pipelines, we have all the transfomers - ingredient - to build a <code>ColumnTransformer</code> to apply to our dataset.

### Column Transformer
![](slides/Slide19.png)
![](slides/Slide20.png)

**Task**: Implement the ColumnTransformer depicted here:
![](figures/ColumnTransformerOverview.jpg)

Start from the playground DataFrame loaded in the next code cell and put all the blocks together ðŸ˜ŠðŸ˜Š

In [None]:
playground = pd.read_csv('data/playground.csv').dropna(thresh=6000,axis=1).dropna(axis=0,how='all')
playground

Unnamed: 0,age,length_screw,diameter,hammer_strength,rank,size,country
0,47.0,39.537837,17.053504,46.389102,medium,L,Denmark
1,20.0,-1.034592,-67.283321,-167.451060,good,XXL,Denmark
2,72.0,40.686959,-21.411874,-34.345003,medium,L,Austria
3,23.0,60.179331,39.712469,118.023664,perfect,XL,Austria
4,93.0,2.862281,33.059898,99.476390,perfect,XL,Germany
...,...,...,...,...,...,...,...
9995,80.0,29.688847,-8.900835,-15.037037,good,XXL,Denmark
9996,35.0,23.757652,54.140757,124.841787,medium,L,Germany
9997,31.0,7.493367,-32.458441,-83.878678,perfect,L,France
9998,33.0,33.341941,-36.872158,-84.604433,good,M,Austria


In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
playground_tf = ColumnTransformer(
    transformers=[
        ('age', MinMaxScaler(), ['age']),
        ('imp_std', imp_scaler, ['length_screw','hammer_strength']),
        ('diam_std', StandardScaler(), ['diameter']),
        ('rank', imp_ordinal2, ['rank']),
        ('imp_ord', imp_ordinal, ['size']),
        ('country_hot', OneHotEncoder(drop='first',categories='auto'),['country'])
    ],
    verbose_feature_names_out = False
)

In [None]:
playground_tf.fit(playground)
new_playground = pd.DataFrame(playground_tf.transform(playground), columns=playground_tf.get_feature_names_out())
new_playground

Unnamed: 0,age,length_screw,hammer_strength,diameter,rank,size,country_Canada,country_Denmark,country_France,country_Germany,country_Italy,country_Spain
0,0.3625,0.480640,0.313493,0.387899,0.333333,0.50,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0250,-1.539661,-1.802101,-1.710315,0.666667,1.00,0.0,1.0,0.0,0.0,0.0,0.0
2,0.6750,0.537861,-0.485237,-0.569080,0.333333,0.50,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0625,1.508482,1.022198,0.951631,1.000000,0.75,0.0,0.0,0.0,0.0,0.0,0.0
4,0.9375,-1.345616,0.838703,0.786122,1.000000,0.75,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
9985,0.7750,-0.009790,-0.294217,-0.257818,0.666667,1.00,0.0,1.0,0.0,0.0,0.0,0.0
9986,0.2125,-0.305133,1.089652,1.310592,0.333333,0.50,0.0,0.0,0.0,1.0,0.0,0.0
9987,0.1625,-1.115012,-0.975291,-0.843907,1.000000,0.50,0.0,0.0,1.0,0.0,0.0,0.0
9988,0.1875,0.172116,-0.982471,-0.953716,0.666667,0.25,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
playground_tf_verbose = ColumnTransformer(
    transformers=[
        ('age', MinMaxScaler(), ['age']),
        ('imp_std', imp_scaler, ['length_screw','hammer_strength']),
        ('diam_std', StandardScaler(), ['diameter']),
        ('rank', imp_ordinal2, ['rank']),
        ('imp_ord', imp_ordinal, ['size']),
        ('country_hot', OneHotEncoder(drop='first',categories='auto'),['country'])
    ],
    verbose_feature_names_out = True
)
playground_tf_verbose.fit(playground)
pd.DataFrame(playground_tf_verbose.transform(playground), columns=playground_tf_verbose.get_feature_names_out())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.3625,0.480640,0.313493,0.387899,0.333333,0.50,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0250,-1.539661,-1.802101,-1.710315,0.666667,1.00,0.0,1.0,0.0,0.0,0.0,0.0
2,0.6750,0.537861,-0.485237,-0.569080,0.333333,0.50,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0625,1.508482,1.022198,0.951631,1.000000,0.75,0.0,0.0,0.0,0.0,0.0,0.0
4,0.9375,-1.345616,0.838703,0.786122,1.000000,0.75,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
9985,0.7750,-0.009790,-0.294217,-0.257818,0.666667,1.00,0.0,1.0,0.0,0.0,0.0,0.0
9986,0.2125,-0.305133,1.089652,1.310592,0.333333,0.50,0.0,0.0,0.0,1.0,0.0,0.0
9987,0.1625,-1.115012,-0.975291,-0.843907,1.000000,0.50,0.0,0.0,1.0,0.0,0.0,0.0
9988,0.1875,0.172116,-0.982471,-0.953716,0.666667,0.25,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
new_playground.shape

(9990, 12)

### Creating customer transfomers
While SKLearn has many Transformers, it's often helpful to create our own. Here we will create a custom Transformer from scratch. In fact, in SKLearn, an object is a Transformer if it implements the methods:
1. fit(X, y)
2. transform(X) which should return a Pandas Data Frame or a Numpy array

So, we have to define a class that inherits from the <code>BaseEstimator</code> and <code>TransformerMixin</code> classes found in the <code>sklearn.base</code> module, and overides the methods _fit_ and _transform_.

There is a second method to build custom transformers. The <code>FunctionTransformer</code> class builds a Transformer wrapper taking a function object, which implements the transformation.

First, we define the transformation, i.e. we define a function. Then, we pass the function to a <code>FunctionTransformer</code>.

## Time to fly on your own

We introduced all the most important elements to build a semi-automatic preprocessing pipeline. In this part of the notebook **you** have to deal with a new dataset.

The dataset is available at the folder /data/heart.csv

Its features are:
- age : Age of the patient
- sex : Sex of the patient
- exng: exercise induced angina (1 = yes; 0 = no)
- ca: number of major vessels (0-3)
- cp : Chest Pain type chest pain type:
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- trtbps : resting blood pressure (in mm Hg)
- chol : cholestoral in mg/dl fetched via BMI sensor
- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- rest_ecg : resting electrocardiographic results:
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach : maximum heart rate achieved
- zip: zip code of the patient

In [1]:
import sklearn
import numpy as np
import pandas as pd
import matplotlib as mpl
import scipy as sp
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [14]:
heart = pd.read_csv('data/heart.csv', sep=',')

Steps to tackle:
- Exploratory analysis of the dataset: number or rows, number of columns, column names, column types
- Missing data: columns containing missing data? Are there corrupted rows? which columns/rows do I have to remove? Imputation?
- Handling categorical data? How many columns are ordinal? How many nominal?
- Feature scaling
- Pipeline and ColumnTransformer implementation

*Hint*: A possible summary of the overall transformation may be given by the following figure:
![](figures/ColumnTransformerExercise.jpg)

In [15]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       297 non-null    float64
 2   cp        302 non-null    float64
 3   trtbps    302 non-null    float64
 4   chol      298 non-null    float64
 5   fbs       293 non-null    float64
 6   restecg   302 non-null    float64
 7   thalachh  300 non-null    float64
 8   exng      302 non-null    float64
 9   oldpeak   302 non-null    float64
 10  slp       302 non-null    float64
 11  caa       302 non-null    float64
 12  thall     303 non-null    int64  
 13  zip       303 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 33.3 KB


In [18]:
heart = heart.dropna(axis=0, thresh = heart.isnull().sum(axis=1).max())

In [19]:
pip = Pipeline(
    [
        ('imp', SimpleImputer(strategy='mean')),
        ('stand',StandardScaler())
    ]
)

heart_ct = ColumnTransformer(
    transformers= [
        ('1',MinMaxScaler(),['age','caa','slp','thall']),
        ('2',pip, ['chol','fbs','thalachh']),
        ('3',SimpleImputer(strategy='most_frequent'),['sex']),
        ('4',OneHotEncoder(drop='first',categories='auto',handle_unknown='ignore'),['zip','restecg','cp']),
        ('5',StandardScaler(),['trtbps','oldpeak'])
    ],
    verbose_feature_names_out=False,
    remainder='passthrough'
)

heart_ct.fit(heart)

In [20]:
new_heart = pd.DataFrame(heart_ct.transform(heart), columns=heart_ct.get_feature_names_out())

In [22]:
new_heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302 entries, 0 to 301
Data columns (total 20 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          302 non-null    float64
 1   caa          302 non-null    float64
 2   slp          302 non-null    float64
 3   thall        302 non-null    float64
 4   chol         302 non-null    float64
 5   fbs          302 non-null    float64
 6   thalachh     302 non-null    float64
 7   sex          302 non-null    float64
 8   zip_25100    302 non-null    float64
 9   zip_26026    302 non-null    float64
 10  zip_26030    302 non-null    float64
 11  zip_26100    302 non-null    float64
 12  restecg_1.0  302 non-null    float64
 13  restecg_2.0  302 non-null    float64
 14  cp_1.0       302 non-null    float64
 15  cp_2.0       302 non-null    float64
 16  cp_3.0       302 non-null    float64
 17  trtbps       302 non-null    float64
 18  oldpeak      302 non-null    float64
 19  exng    

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6ee4087b-eb51-48aa-a70e-4a2bd2d4632f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>