# IAU projekt

> Rovnakým podieľom práce vypracovali: <br>
> Roman Bitarovský, Emma Macháčová

## Table of contents <a name="obsah"></a>
* [Zadanie](#zadanie)
    * [Slovníček](#slovnicek)
   
* [Data init (Fáza 1)](#dataInit)
* [Fáza 2](#faza2)
    * [2.1. Integrácia a čistenie dát](#2.1.)
        * [2.1.1. Replacing NaNs](#2.1.1.)
            * [2.1.1.1. Replacing NaNs - Method 1: Drop nans](#2.1.1.a)
            * [2.1.1.2. Replacing NaNs - Method 2: Replace with Mean](#2.1.1.b)
            * [2.1.1.3. Replacing NaNs - Method 3: Replace with Median](#2.1.1.c)
            * [2.1.1.4. Replacing NaNs - Method 4: Replace with kNN](#2.1.1.d)
        * [2.1.2. Deleting Outliers Values](#2.1.2.)
    * [2.2. Realizácia predspracovania dát](#2.2.)  
        * [2.2.1. Transforovanie a škálovanie dát](#2.2.1.)
        * [2.2.2. Rozdelenie dát](#2.2.2.)
        * [2.2.3. Zhodnotenie ](#2.2.3.)
    * [2.3. Výber atribútov pre strojové učenie](#2.3.)  
        * [2.3.1. Variance Threshold ](#2.3.1.)
        * [2.3.2. SelectKBest](#2.3.2.)
        * [2.3.3 SelectPercentile](#2.3.3.)
        * [2.3.4. Záver výberov](#2.3.4.)
    * [2.4. Replikovateľnosť predspracovania](#2.4.)  
        * [2.4.1. Code improvements](#2.4.1.)
        * [2.4.2. Pipeline](#2.4.2.)

# Zadanie <a name="zadanie"></a>
Znečistenie ovzdušia spôsobuje vážne dýchacie a srdcové ochorenia, ktoré môžu byť smrteľné. Najčastejšie sú postihnuté deti, čo vedie k zápalu pľúc a problémom s dýchaním vrátane astmy. Kyslé dažde, ničenie ozónovej vrstvy a globálne otepľovanie sú niektoré z nepriaznivých dôsledkov. Dátová sada pre Vás (World's Air Pollution: Real-time Air Quality Index https://waqi.info/) predstavuje záznamy jednotlivých meraní kvality ovzdušia ako kombinácia mnohých faktorov bez časovej následnosti. V záznamoch je závislá premenná s menom “warning” indikujúca alarmujúci stav kvality ovzdušia. Vo veľkých mestách ako napr. Peking (angl. Beijing, hlavné mesto Číny s viac ako 21 miliónov ľudí) sa pri varovaní spustí opatrenie ako obmedzenie pohybov áut a ľudí v meste alebo umelý dážď až pokiaľ kvalita vzduchu sa nevráti do normu.

* Úlohou je predikovať závislé hodnoty premennej “warning” pomocou metód strojového učenia.
* Pritom sa treba vysporiadať s viacerými problémami, ktoré sa v dátach nachádzajú ako formáty dát, chýbajúce, vychýlené hodnoty a pod.

## Slovníček  <a name="slovnicek"></a>
<details>
    <summary>Zobraziť</summary>
    
    PM2.5 - Particulate Matter (µg/m3) 
    PM10 - Particulate Matter (µg/m3) 
    NOx - Nitrogen Oxides (µg/m3)
    NO2 - Nitrogen Dioxide (µg/m3)
    SO2 - Sulfur Dioxide  (µg/m3)
    CO - Carbon Monoxide emissions  (µg/m3)
    CO2 - Carbon Dioxide  (µg/m3)
    PAHs - Polycyclic Aromatic Hydrocarbons  (µg/m3)
    NH3 - Ammonia trace  (µg/m3)
    Pb - Lead  (µg/m3)
    TEMP - Temperature (degree Celsius)
    DEWP - Dew point temperature (degree Celsius)
    PRES - Pressure (hPa, <100, 1050>)
    RAIN - Rain (mm)
    WSPM - Wind Speed (m/s)
    WD - Wind Direction
    VOC - Volatile Organic Compounds
    CFCs - Chlorofluorocarbons
    C2H3NO5 - Peroxyacetyl nitrate
    H2CO - Plywood emit formaldehyde
    GSTM1 - Glutathione-S transferase M1
    1-OHP - 1-hydroxypyrene
    2-OHF - 2-hydroxyfluorene
    2-OHNa - 2-hydroxynaphthalene
    N2 - Nitrogen
    O2 - Oxygen
    O3 - Ozone
    Ar - Argon
    Ne - Neon
    CH4 - Methane
    He - Helium
    Kr - Krypton
    I2 - Iodine
    H2 - Hydrogen
    Xe - Xenon
</details>

# Data init <a name="dataInit"></a>

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.stats as sm_stats
from scipy.stats import mannwhitneyu
from scipy.stats import f_oneway
import datetime
import re
import category_encoders as ce
from sklearn.impute import SimpleImputer, KNNImputer
from numpy import percentile
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from sklearn.feature_selection import VarianceThreshold, SelectKBest, SelectPercentile, SelectFromModel
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_regression, chi2, f_regression, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
labor_measurements = pd.read_csv('../081/measurements.csv', sep='\t')
labor_stations = pd.read_csv('../081/stations.csv', sep='\t')

Úprava dát rovnaká ako vo fáze 1, hlavne teda merge tabuliek.

In [3]:
labor_stations["QoS"].replace({"acceptable": "accep", "maitennce": "maintenance"}, inplace=True)
labor_stations['revision'] = pd.to_datetime(labor_stations['revision'], utc=False)

labor_measurements.replace('', np.nan, inplace=True)
labor_measurements.replace(r'^\s*$', np.nan, regex=True)
labor_stations.replace('', np.nan, inplace=True)
labor_stations.replace(r'^\s*$', np.nan, regex=True)

labor_measurements = labor_measurements.drop_duplicates()
labor_stations = labor_stations.drop_duplicates()

# merge preprocesing
labor_stations = labor_stations.drop(columns=['revision', 'code', 'QoS'])
labor_stations = labor_stations.drop_duplicates()

# Table merge
df = pd.merge(labor_measurements, labor_stations, how='inner', left_on=['latitude', 'longitude'], right_on=['latitude', 'longitude'])

df = df.drop(columns=['latitude', 'longitude'])
df = df[['location', 'warning', 'TEMP', 'PRES', 'PM2.5', 'NOx', 'PM10', 'C2H3NO5', 'CH4', 'Pb', 'NH3', 'SO2', 'O3', 'CO', 'PAHs', 'H2CO', 'CFCs']]

df.head()

Unnamed: 0,location,warning,TEMP,PRES,PM2.5,NOx,PM10,C2H3NO5,CH4,Pb,NH3,SO2,O3,CO,PAHs,H2CO,CFCs
0,America/Los_Angeles,0.0,20.05101,1139.12673,8.47714,9.21522,9.38738,1.51791,7.84989,59.51096,10.43604,5.81201,7.77502,9.69678,8.6209,47.6481,74.87342
1,America/Los_Angeles,1.0,21.55701,1115.19699,7.3688,9.66741,8.19826,0.64236,8.48027,54.0398,9.62838,7.97135,9.72566,5.83821,8.28391,64.99154,63.42154
2,America/Los_Angeles,1.0,3.06998,1086.02547,9.81855,9.66138,6.16989,0.23616,8.49506,47.32216,6.38848,6.14333,9.73098,7.3773,5.98279,43.12537,71.61779
3,America/Los_Angeles,1.0,10.04558,1168.0234,8.7647,10.27526,7.1013,0.1708,7.35744,48.49527,8.11869,6.74522,9.6333,4.8981,8.76285,43.67037,64.6402
4,America/Los_Angeles,1.0,24.88676,1061.95581,6.7671,9.95663,8.35517,0.75765,6.98671,52.91472,8.87397,9.24788,8.40595,10.82485,7.88543,40.39068,70.4639


Prekodovanie textu lokácie na číselné hodnoty pre umožnenie spracovania ML. 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11939 entries, 0 to 11938
Data columns (total 17 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   location  11939 non-null  object 
 2   TEMP      11891 non-null  float64
 3   PRES      11939 non-null  float64
 4   PM2.5     11891 non-null  float64
 5   NOx       11891 non-null  float64
 6   PM10      11891 non-null  float64
 7   C2H3NO5   11891 non-null  float64
 8   CH4       11891 non-null  float64
 9   Pb        11891 non-null  float64
 10  NH3       11891 non-null  float64
 11  SO2       11891 non-null  float64
 12  O3        11891 non-null  float64
 13  CO        11891 non-null  float64
 14  PAHs      11891 non-null  float64
 15  H2CO      11891 non-null  float64
 16  CFCs      11891 non-null  float64
dtypes: float64(16), object(1)
memory usage: 1.6+ MB


# Fáza 2 - Pipeline <a name="faza2"></a> 

In [5]:
df_not_changed = df.copy() # zachovanie originálneho df pre potencionálne pororvnávanie

### Utils

In [6]:
def count_columns(df):
    return df.columns[df.isnull().any()].tolist()

In [7]:
def draw(df):
    fig, ax = plt.subplots(figsize=(16,8))
    corr_diff = df.corr() - df_original.corr()
    sns.heatmap(corr_diff[abs(corr_diff) > 0.000099], ax=ax, annot=True, fmt=".4f")
    pass

In [8]:
def df_columns(df):
    new_cols = []
    
    for col in df.columns:
        if col not in ['location', 'warning']:
            new_cols.append(col)
        
    print(new_cols)
    return new_cols

### Handle NaNs

In [9]:
class HandleNaNs_drop(TransformerMixin):
    
    def __init__(self):
        pass
    
    def replaceNaN(self, df):
        df = df.dropna().reset_index()
        print(df.isnull().sum())
        return df
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        return self.replaceNaN(X)

In [10]:
class HandleNaNs_mean(TransformerMixin):
    
    def __init__(self):
        pass
    
    def replaceNaN(self, df):
        na_cols = df_columns(df)
        imp_strategy = SimpleImputer(missing_values=np.nan, strategy='mean')
        
        for col in na_cols:
            df[col] = imp_strategy.fit_transform(df[[col]])

        df = df.dropna().reset_index()
        print(df.isnull().sum())

        return df
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        return self.replaceNaN(X)

In [11]:
class HandleNaNs_median(TransformerMixin):
    
    def __init__(self):
        pass
    
    def replaceNaN(self, df):
        na_cols = df_columns(df)
        imp_strategy = SimpleImputer(missing_values=np.nan, strategy='median')
        
        for col in na_cols:
            df[col] = imp_strategy.fit_transform(df[[col]])

        df = df.dropna().reset_index()
        print(df.isnull().sum())

        return df
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        return self.replaceNaN(X)

In [12]:
class HandleNaNs_knn(TransformerMixin):
    
    def __init__(self):
        pass
    
    def replaceNaN(self, df):
        na_cols = df_columns(df)
        imp_strategy = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
        
        for col in na_cols:
            df[col] = imp_strategy.fit_transform(df[[col]])

        df = df.dropna().reset_index()
        print(df.isnull().sum())

        return df
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        return self.replaceNaN(X)

### Handle non numeric atributes

In [13]:
class HandleLocation(TransformerMixin):
    
    def __init__(self):
        pass

    def encodeLocation(self, df):
        # prekodovanie textu locacie n číslo 
        ce_ordinal = ce.OrdinalEncoder(cols=['location'])
        return ce_ordinal.fit_transform(df)
        
    def fit(self, X):
        return self
    
    def transform(self, X):
        return self.encodeLocation(X)

### Handle Outliers

In [14]:
class HandleOutliers_replace(TransformerMixin):
    
    def __init__(self):
        pass
       
    def handleOutliers(self, df):
        
        for col in df_columns(df):  
            
            q05 = percentile(df[col], 5)
            q95 = percentile(df[col], 95)

            df[col] = np.where(df[col] < q05, q05, df[col])
            df[col] = np.where(df[col] > q95, q95, df[col])
            
        return df
         
    def fit(self, X):
        return self
    
    def transform(self, X):
        return self.handleOutliers(X)

### Handle Transformations

In [15]:
class HandleTransformations_power(TransformerMixin):

    def __init__(self):
        pass
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        df = X
        power = PowerTransformer(method='yeo-johnson', standardize=True)
        new_df = pd.DataFrame(power.fit_transform(df), columns = df.columns)
        new_df['location'] = df['location']
        new_df['warning'] = df['warning']
        return new_df

In [16]:
class HandleTransformations_quant(TransformerMixin):

    def __init__(self):
        pass
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        df = X
        quan = QuantileTransformer(n_quantiles=10, random_state=0)
        new_df = pd.DataFrame(quan.fit_transform(df), columns = df.columns)
        new_df['location'] = df['location']
        new_df['warning'] = df['warning']
        return new_df

In [17]:
class HandleTransformations_scaleMM(TransformerMixin):

    def __init__(self):
        pass
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        df = X
        norm_s = MinMaxScaler()
        new_df = pd.DataFrame(norm_s.fit_transform(df), columns = df.columns)
        new_df['location'] = df['location']
        new_df['warning'] = df['warning']
        return new_df

In [18]:
class HandleTransformations_scaleS(TransformerMixin):

    def __init__(self):
        pass
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        df = X
        stan_s = StandardScaler()
        new_df = pd.DataFrame(stan_s.fit_transform(df), columns = df.columns)
        new_df['location'] = df['location']
        new_df['warning'] = df['warning']
        return new_df

### Split train and test

In [19]:
class Split(TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X):
        return self
    
    def transform(self, X):  
        df = X
        X_train, X_test, y_train, y_test = train_test_split(df.drop(['warning'], axis=1), df['warning'], test_size=0.33)
        return X_train, X_test, y_train, y_test

### Handle Selection

In [20]:
class VarianceThreshold_do(TransformerMixin):
    
    def __init__(self):
        pass
        
    def fit(self, X):
        return self
    
    def transform(self, X):
        df = X.copy()

        sel = VarianceThreshold(.8 * (1 - .8))
        colsVT = sel.fit_transform(df)
                        
        if (df.shape[1] == colsVT[0].size):
            print('VarianceThreshold: Všetky dáta sú užitočné')
            
        elif (colsVT[0].size < df.shape[1]):
            print('VarianceThreshold: Máme aj neužitočné dáta')
        
        return X


In [21]:
class Selection_KBest_mutual_info_regression(TransformerMixin):
    
    def __init__(self):
        pass
    
    def orderColumns(self, tuple_of_df):
        X_train, X_test, y_train, y_test = tuple_of_df[0], tuple_of_df[1], tuple_of_df[2], tuple_of_df[3]
                
        selector = SelectKBest(mutual_info_regression, k='all')
        
        selected = selector.fit_transform(X_train, y_train)
        
        scores = selector.scores_
        
        col_names = X_train.columns[selector.get_support()]

        indices = []
        for _, x in sorted(zip(scores, col_names), reverse=True):
            indices.append(x)

        X_train.columns = indices
        
        return X_train, X_test, y_train, y_test
        
    def fit(self, tuple_of_df):
        return self
    
    def transform(self, tuple_of_df):
        return self.orderColumns(tuple_of_df)


In [22]:
class Selection_KBest_f_regression(TransformerMixin):
    
    def __init__(self):
        pass
    
    def orderColumns(self, tuple_of_df):
        X_train, X_test, y_train, y_test = tuple_of_df[0], tuple_of_df[1], tuple_of_df[2], tuple_of_df[3]
                
        selector = SelectKBest(f_regression, k='all')
        
        selected = selector.fit_transform(X_train, y_train)
        
        scores = selector.scores_
        
        col_names = X_train.columns[selector.get_support()]

        indices = []
        for _, x in sorted(zip(scores, col_names), reverse=True):
            indices.append(x)

        X_train.columns = indices
        
        return X_train, X_test, y_train, y_test
        
    def fit(self, tuple_of_df):
        return self
    
    def transform(self, tuple_of_df):
        return self.orderColumns(tuple_of_df)


In [23]:
class Selection_Percentile_f_classif(TransformerMixin):
    
    def __init__(self):
        pass
    
    def orderColumns(self, tuple_of_df):
        X_train, X_test, y_train, y_test = tuple_of_df[0], tuple_of_df[1], tuple_of_df[2], tuple_of_df[3]
                
        selector = SelectPercentile(f_classif, percentile=100)
        
        selected = selector.fit_transform(X_train, y_train)
        
        scores = selector.scores_
        
        col_names = X_train.columns[selector.get_support()]

        indices = []
        for _, x in sorted(zip(scores, col_names), reverse=True):
            indices.append(x)

        X_train.columns = indices
        
        return X_train, X_test, y_train, y_test
        
    def fit(self, tuple_of_df):
        return self
    
    def transform(self, tuple_of_df):
        return self.orderColumns(tuple_of_df)


In [24]:
class Selection_Percentile_f_regression(TransformerMixin):
    
    def __init__(self):
        pass
    
    def orderColumns(self, tuple_of_df):
        X_train, X_test, y_train, y_test = tuple_of_df[0], tuple_of_df[1], tuple_of_df[2], tuple_of_df[3]
                
        selector = SelectPercentile(f_regression, percentile=100)
        
        selected = selector.fit_transform(X_train, y_train)
        
        scores = selector.scores_
        
        col_names = X_train.columns[selector.get_support()]

        indices = []
        for _, x in sorted(zip(scores, col_names), reverse=True):
            indices.append(x)

        X_train.columns = indices
        
        return X_train, X_test, y_train, y_test
        
    def fit(self, tuple_of_df):
        return self
    
    def transform(self, tuple_of_df):
        return self.orderColumns(tuple_of_df)


## 2.4.2. Pipeline <a name="2.4.2."></a>

### Pipeline č. 1

In [25]:
def pipelineGenerator():
    
    pipeline =  Pipeline([
        ('HandleNaNs', HandleNaNs_drop()),
        ('HandleLocation', HandleLocation()),
        ('HandleOutliers', HandleOutliers_replace()),
        ('HandleTransformations', HandleTransformations_power()),
        ('HandleSelection', VarianceThreshold_do()),
        ('Split', Split()),
        ('handleSelection2', Selection_Percentile_f_regression()),
        
    ])
    return pipeline

In [26]:
pipeline1 = pipelineGenerator()
X_train, X_test, y_train, y_test = pipeline1.fit_transform(df_not_changed.copy())

index       0
location    0
TEMP        0
PRES        0
PM2.5       0
NOx         0
PM10        0
C2H3NO5     0
CH4         0
Pb          0
NH3         0
SO2         0
O3          0
CO          0
PAHs        0
H2CO        0
CFCs        0
dtype: int64
['index', 'TEMP', 'PRES', 'PM2.5', 'NOx', 'PM10', 'C2H3NO5', 'CH4', 'Pb', 'NH3', 'SO2', 'O3', 'CO', 'PAHs', 'H2CO', 'CFCs']
VarianceThreshold: Všetky dáta sú užitočné


### Pipeline č. 2

## Export do CSV

In [27]:
def toFiles(X_train, X_test, y_train, y_test):
    X_train.to_csv('X_train.csv', sep=';')
    X_test.to_csv('X_test.csv', sep=';')
    y_train.to_csv('y_train.csv', sep=';')
    y_test.to_csv('y_test.csv', sep=';')

In [28]:
# toFiles(X_train, X_test, y_train, y_test)

In [29]:
print(len(X_train))
print(len(y_train))
print(len(X_test))
print(len(y_test))

7529
7529
3709
3709


# Fáza 3 - Strojové učenie 

In [30]:
def print_results(predicted_train, predicted_test, y_train, y_test):
    
    print("Predicting for train dataset:")
    print(classification_report(y_train, predicted_train))

    print("Predicting for test dataset:")
    print(classification_report(y_test, predicted_test))
    
    pass

## Jednoduchý klasifikátor na základe závislosti v dátach (5b)
* Naimplementujte OneR algorithm (iné mená: OneRule or 1R), ktorý je jednoduchý klasifikátor tzv. rozhodnutie na základe jedného atribútu. Môžete implementovať aj komplikovanejšie t.j. rozhodnutie na základe kombinácie atribútov.
* Algoritmus by mal byť realizovaný na základe závislostí v dátach. Vyhodnoťte klasifikátora pomocou metrík accuracy, precision a recall.

## Trénovanie a vyhodnotenie klasifikátorov strojového učenia 
* Na trénovanie využite minimálne jeden stromový algoritmus strojového učenia v scikit-learn.
* Vizualizujte natrénované pravidlá.
* Vyhodnoťte natrénované modely pomocou metrík accuracy, precision a recall
* Porovnajte ašpoň jeden natrénovaný klasifikátor v scikit-learn s jednoduchým klasifikátorom z prvého kroku.

In [31]:
### 3.2.1 Decision tree

In [32]:
def decisionTreeDriver(X_train, X_test, y_train, y_test):
    
    cls = DecisionTreeClassifier(max_depth=None, random_state=1)
    cls.fit(X_train, y_train)
    
    predicted_train = cls.predict(X_train)
    predicted_test = cls.predict(X_test)
    
    print_results(predicted_train, predicted_test, y_train, y_test)
    
    return cls, predicted_train, predicted_test

In [34]:
decisionTreeDriver(X_train, X_test, y_train, y_test)

NameError: name 'DecisionTreeClassifier' is not defined