# Lab 4 - potoki transformujące zbiory danych

Na podstawie materiałów zawartych w labach i (w przyszłości :) ) na wykładzie można z łatwością zauważyć, że kolejność wykonania operacji jest niezwykle istotna. W przypadku dużych zbiorów danych, wymagających szeroko zakrojonych operacji transformacji, zapanowanie nad kodem i kolejnością wykonywania operacji może być problematyczne.

Rozwiązaniem problemu są potoki transformujące z biblioteki **Scikit-learn**.

Dzięki potokom można z łatwością utrzymywać kod w sposób modularny, co oznacza, że można z łatwością dzielić zadania na mniejsze etapy. Potoki transformujące pomagają unikać zjawiska wycieków informacji z danych treningowych do danych testowych za sprawą izolowania poznanych transformacji do danych treningowych, a następnie stosowanie tych samych transformacji do danych testowych lub walidacyjnych. Zastosowanie takich operacji optymalizujących transformacje wpływają pozytywnie na oszczędność czasu potrzebnego na ogarnięcie dużych fragmentów kodu, a także na walkę z późniejszymi błędami.

## Stosowanie gotowych transformatorów

Stosowanie potoków transformujących polega na utworzeniu instancji klasy **Pipeline**, której inicjalizator przyjmuje listę zawierającą sprecyzowane kroki przetwarzające dane w postaci krotek: (nazwa, estymator). W znacznej części przypadków wystarczające pozostają klasy (np. *SimpleImputer*) dostarczane przez bibliotekę **Scikit-learn**. Warto mieć na uwadze fakt, że estymator musi być klasą która zawiera metody *fit* oraz *transform*.

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

In [25]:
num_values_pipeline = Pipeline([
    ('impute_missing_values', SimpleImputer(strategy='mean')),
    ('scale_values', MinMaxScaler()),
])

Alternatywą jest zastosowanie funkcji *make_pipeline*, która przyjmuje dowolną liczbę parametrów w postaci estymatorów. Warto zauważyć, że w tym przypadku nie występuje konieczność przekazania nazw poszczególnych kroków.

In [26]:
from sklearn.pipeline import make_pipeline

In [27]:
num_values_pipeline = make_pipeline(
    SimpleImputer(strategy='mean'),
    MinMaxScaler(),
)

Zastosowanie potoku na zbiorze danych wymaga wywołania kolejno metod: *fit* i *transform* lub metody *fit_transform*.

In [28]:
from sklearn.datasets import fetch_california_housing

In [29]:
data = fetch_california_housing(as_frame=True)['frame']

In [30]:
data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [31]:
num_values_pipeline.fit_transform(data)

array([[0.53966842, 0.78431373, 0.0435123 , ..., 0.5674814 , 0.21115538,
        0.90226638],
       [0.53802706, 0.39215686, 0.03822395, ..., 0.565356  , 0.21215139,
        0.70824656],
       [0.46602805, 1.        , 0.05275646, ..., 0.5642933 , 0.21015936,
        0.69505074],
       ...,
       [0.08276438, 0.31372549, 0.03090386, ..., 0.73219979, 0.31175299,
        0.15938285],
       [0.09429525, 0.33333333, 0.03178269, ..., 0.73219979, 0.30179283,
        0.14371281],
       [0.13025338, 0.29411765, 0.03125246, ..., 0.72582359, 0.30976096,
        0.15340349]])

In [32]:
import pandas as pd

In [33]:
data_preprocessed = pd.DataFrame(
    num_values_pipeline.fit_transform(data),
    columns=num_values_pipeline.get_feature_names_out(),
    index=data.index,
)

In [34]:
data_preprocessed

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,0.539668,0.784314,0.043512,0.020469,0.008941,0.001499,0.567481,0.211155,0.902266
1,0.538027,0.392157,0.038224,0.018929,0.067210,0.001141,0.565356,0.212151,0.708247
2,0.466028,1.000000,0.052756,0.021940,0.013818,0.001698,0.564293,0.210159,0.695051
3,0.354699,1.000000,0.035241,0.021929,0.015555,0.001493,0.564293,0.209163,0.672783
4,0.230776,1.000000,0.038534,0.022166,0.015752,0.001198,0.564293,0.209163,0.674638
...,...,...,...,...,...,...,...,...,...
20635,0.073130,0.470588,0.029769,0.023715,0.023599,0.001503,0.737513,0.324701,0.130105
20636,0.141853,0.333333,0.037344,0.029124,0.009894,0.001956,0.738576,0.312749,0.128043
20637,0.082764,0.313725,0.030904,0.023323,0.028140,0.001314,0.732200,0.311753,0.159383
20638,0.094295,0.333333,0.031783,0.024859,0.020684,0.001152,0.732200,0.301793,0.143713


## Potoki dopasowane do typów danych w atrybutach

W przypadku zbiorów danych zawierających różne typy wartości w atrybutach (np. numeryczne i symboliczne), stosowanie potoków uzupełniających wartości wybrakowane za pomocą średniej arytmetycznej może być problematyczne. Rozwiązaniem problemu w takiej sytuacji jest klasa **ColumnTransformer**, która oprócz listy zawierającej nazwe i estymator, przyjmuje także listę nazw atrybutów, na których dany krok ma zostać zastosowany.

In [35]:
from sklearn.datasets import fetch_kddcup99

In [41]:
data = fetch_kddcup99(as_frame=True)['frame']

In [42]:
data

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,labels
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,b'normal.'
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,b'normal.'
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,b'normal.'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494016,0,b'tcp',b'http',b'SF',310,1881,0,0,0,0,...,255,1.0,0.0,0.01,0.05,0.0,0.01,0.0,0.0,b'normal.'
494017,0,b'tcp',b'http',b'SF',282,2286,0,0,0,0,...,255,1.0,0.0,0.17,0.05,0.0,0.01,0.0,0.0,b'normal.'
494018,0,b'tcp',b'http',b'SF',203,1200,0,0,0,0,...,255,1.0,0.0,0.06,0.05,0.06,0.01,0.0,0.0,b'normal.'
494019,0,b'tcp',b'http',b'SF',291,1200,0,0,0,0,...,255,1.0,0.0,0.04,0.05,0.04,0.01,0.0,0.0,b'normal.'


In [43]:
data.dtypes

duration                       object
protocol_type                  object
service                        object
flag                           object
src_bytes                      object
dst_bytes                      object
land                           object
wrong_fragment                 object
urgent                         object
hot                            object
num_failed_logins              object
logged_in                      object
num_compromised                object
root_shell                     object
su_attempted                   object
num_root                       object
num_file_creations             object
num_shells                     object
num_access_files               object
num_outbound_cmds              object
is_host_login                  object
is_guest_login                 object
count                          object
srv_count                      object
serror_rate                    object
srv_serror_rate                object
rerror_rate 

In [44]:
data[['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent']] = data[['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent']].astype(int)

In [45]:
data[['protocol_type', 'service', 'flag', 'labels']] = data[['protocol_type', 'service', 'flag', 'labels']].map(lambda x: x.decode('utf-8'))

In [46]:
data.dtypes

duration                        int32
protocol_type                  object
service                        object
flag                           object
src_bytes                       int32
dst_bytes                       int32
land                            int32
wrong_fragment                  int32
urgent                          int32
hot                            object
num_failed_logins              object
logged_in                      object
num_compromised                object
root_shell                     object
su_attempted                   object
num_root                       object
num_file_creations             object
num_shells                     object
num_access_files               object
num_outbound_cmds              object
is_host_login                  object
is_guest_login                 object
count                          object
srv_count                      object
serror_rate                    object
srv_serror_rate                object
rerror_rate 

In [40]:
data

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,labels
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494016,0,tcp,http,SF,310,1881,0,0,0,0,...,255,1.0,0.0,0.01,0.05,0.0,0.01,0.0,0.0,normal.
494017,0,tcp,http,SF,282,2286,0,0,0,0,...,255,1.0,0.0,0.17,0.05,0.0,0.01,0.0,0.0,normal.
494018,0,tcp,http,SF,203,1200,0,0,0,0,...,255,1.0,0.0,0.06,0.05,0.06,0.01,0.0,0.0,normal.
494019,0,tcp,http,SF,291,1200,0,0,0,0,...,255,1.0,0.0,0.04,0.05,0.04,0.01,0.0,0.0,normal.


In [47]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

In [48]:
cat_values_pipeline = make_pipeline(
    OrdinalEncoder(handle_unknown='error'),
)

In [49]:
preprocessing_pipeline = ColumnTransformer([
    ('num_attributes_steps', num_values_pipeline, data.select_dtypes('number').columns),
    ('cat_attributes_steps', cat_values_pipeline, ('protocol_type', 'service', 'flag', 'labels')),
])

Alternatywnie, jak w przypadku funkcji *make_pipeline*, zastosowanie funkcji *make_column_transformer* pozwoli na pominięcie wskazania nazwy kroku. Wartym uwagi dodatkiem jest funkcja *make_column_selector*, która wybierze atrybuty o wskazanym typie.

In [50]:
from sklearn.compose import make_column_selector, make_column_transformer

In [51]:
preprocessing_pipeline = make_column_transformer(
    (num_values_pipeline, make_column_selector(dtype_include='number')),
    (cat_values_pipeline, ('protocol_type', 'service', 'flag', 'labels')),
)

In [52]:
data_preprocessed = pd.DataFrame(
    preprocessing_pipeline.fit_transform(data),
    columns=preprocessing_pipeline.get_feature_names_out(),
    index=data.index,
)

In [53]:
data_preprocessed

Unnamed: 0,pipeline-1__duration,pipeline-1__src_bytes,pipeline-1__dst_bytes,pipeline-1__land,pipeline-1__wrong_fragment,pipeline-1__urgent,pipeline-2__protocol_type,pipeline-2__service,pipeline-2__flag,pipeline-2__labels
0,0.0,2.610418e-07,0.001057,0.0,0.0,0.0,1.0,22.0,9.0,11.0
1,0.0,3.446905e-07,0.000094,0.0,0.0,0.0,1.0,22.0,9.0,11.0
2,0.0,3.389216e-07,0.000259,0.0,0.0,0.0,1.0,22.0,9.0,11.0
3,0.0,3.158461e-07,0.000259,0.0,0.0,0.0,1.0,22.0,9.0,11.0
4,0.0,3.129617e-07,0.000394,0.0,0.0,0.0,1.0,22.0,9.0,11.0
...,...,...,...,...,...,...,...,...,...,...
494016,0.0,4.470881e-07,0.000365,0.0,0.0,0.0,1.0,22.0,9.0,11.0
494017,0.0,4.067060e-07,0.000443,0.0,0.0,0.0,1.0,22.0,9.0,11.0
494018,0.0,2.927706e-07,0.000233,0.0,0.0,0.0,1.0,22.0,9.0,11.0
494019,0.0,4.196859e-07,0.000233,0.0,0.0,0.0,1.0,22.0,9.0,11.0


## Zadania

1. Dokonać refaktoryzacji zadań z labów 2 i 3 w taki sposób, aby ich implementacja została zrealizowana całkowicie za pomocą potoków.