# Warsztaty Python w Data Science

---
## Inżynieria Cech (__*Feature Engineering*__) - część 1 z 2  

- ### Operacje na prostych wartościach
  - #### Binaryzacja
  - #### "*Kubełkowanie*" - Binning 

- ### Wygładzanie Laplace'a

- ### Skalowanie wartości
  - #### Logarytmiczne
  - #### Skalowanie Min-Max 
  - #### *Robust scaling*
  - #### Standaryzacja 

- ### Zmienne kategoryczne
  - #### Indeksacja
  - #### *One-hot encoding*
  - #### Porządek w zmiennych kategorycznych

- ### Dobór cech (__*Feature Selection*__)

---

> ## The features you use influence more than everything else the result. 
> ## No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.
>## <div style="text-align: right">— Luca Massaron Autor, Kaggle master</div>

---

> ## Coming up with features is difficult, time-consuming, requires expert knowledge.
> ## "_*Applied machine learning*_" is basically feature engineering.
> ## <div style="text-align: right">— Andrew Ng</div>

---

# Proste wartości<a id="simple"></a>

## Uzasadnienie dla binaryzacji

In [None]:
import pickle
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import matplotlib.dates as mdates

count_df = pickle.load( open( "data/count.p", "rb" ) )

plt.figure(figsize=(24,12))
plt.style.use("dark_background")

chart = sns.scatterplot(data=count_df)

pickle.dump( count_df, open( "data/count.p", "wb" ) )



---
## "*Kubełkowanie*" - Binning 


### Histogram o stałej szerokości 

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
df

In [None]:
df.shape

In [None]:
pd.cut(df.value, range(0, 105, 10), right=False)

In [None]:
labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
labels

In [None]:
pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)

In [None]:
df['Group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df

In [None]:
df.groupby('Group').count()

In [None]:
df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
df

In [None]:
[ df.quantile(q) for q in [.25, .5, .75] ] 

In [None]:
pd.qcut(df['value'], q=4)

In [None]:
df['quartile']=pd.qcut(df['value'], q=4)
df

In [None]:
df['quartile']=pd.qcut(df['value'], q=4, labels=range(1,5))
df

---
# Skalowanie <a id="scale"></a>

- ## Logarytmiczne

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv('data/adverts_29_04.csv', sep=';')
data

In [None]:
data['Price per m2'] = data['Price'] / data['Size (m2)']
data = data.dropna(subset=['Price per m2'])
data['Price per m2'] = data['Price per m2'].map('{:.0f}'.format)
data["day"] = data['Date'].str[0:2]
data["month"] = data['Date'].str[3:5]
data["year"] = data['Date'].str[6:]
df = data.drop(['Price', 'Date'], axis=1)
df

In [None]:
import pandas as pd
from numpy import log2

data = pd.read_csv('data/adverts_29_04.csv', sep=';')
data['Price per m2'] = data['Price'] / data['Size (m2)']
data['Price per m2'] = data['Price per m2'].map('{:.0f}'.format)
data = data.dropna(subset=['Price per m2'])
df = data.drop(['Price', 'Date'], axis=1)
data["Price log"] = data['Size (m2)'].apply(lambda x: log2(x)).map('{:.2f}'.format)
data["day"] = data['Date'].str[0:2]
data["month"] = data['Date'].str[3:5]
data["year"] = data['Date'].str[6:]
data = data.dropna(subset=['Price per m2'])
df = data.drop(['Price', 'Date'], axis=1)
df

---
- ## Min-Max Scaling

#### Skaluje i przesuwa dane tak, by się mieściły między `0` a `1`

### $$x_{minmax}^i = \frac{x^i-min(x)}{max(x)-min(x)}$$



In [None]:
df[["Price per m2", "Size (m2)"]]

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

scaler.fit(df[["Price per m2", "Size (m2)"]])

In [None]:
scaler.data_max_

In [None]:
pd.DataFrame(scaler.transform(df[["Price per m2", "Size (m2)"]]))

In [None]:
pd.DataFrame(scaler.fit_transform(df[["Price per m2", "Size (m2)"]]))


- ## Robust scaling
#### Podobny to skalowania min-max tylko odejmuje medianę i skaluję odległością miedzy 1szym a 3cim kwartylem


In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()

pd.DataFrame(scaler.fit_transform(df[["Price per m2", "Size (m2)"]]))


- ## Standaryzacja

#### Standaryzacja polega na sprowadzeniu dowolnego rozkładu normalnego do rozkładu standaryzowanego o wartości oczekiwanej `0` i odchyleniu standardowym `1`.


In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

pd.DataFrame(scaler.fit_transform(df[["Price per m2", "Size (m2)"]]))

---

# UWAGA

##  <span style="color: red">NIE</span> zamieniać danych <span style="color: cyan">RZADKICH (dużo zer)</span> w <span style="color: cyan">GĘSTE (mało zer)</span>

---


## Wygładzanie Laplace'a


- ### Dodaj `1` do liczników (zaczynaj od `1` a nie `0`)
- ### Uodparnia model na pomijanie całkowite mało prawdopodobnych zdarzeń
- ### Dobrze radzi sobie z liczeniem __*względnych*__ wartości 

## Przykład: rzut monetą asymetryczną

### $n_0$ - ile razy wypadła  "reszka"
### $n_1$ - ile razy wypadł "orzeł"

### Estymator: 
### $\hat{p} = \frac{n_0+1}{n_0 + n_1 + 2}$
### jest lepszy (mniejszy błąd średnio-kwadratowy) od
### Estymator: $\hat{p} = \frac{n_0}{n_0 + n_1}$


---
# Zmienne kategoryczne <a id="cat"></a>

##  Indeksacja

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

label_encoded = df

label_encoded['Location_Cat'] = labelencoder.fit_transform(label_encoded['Location'])
label_encoded

---
## __*One-hot encoding*__ 

In [None]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')

enc_df = pd.DataFrame(enc.fit_transform(label_encoded[['Location_Cat']]).toarray())

one_hot_data = label_encoded.join(enc_df)
one_hot_data

In [None]:
dum_df = pd.get_dummies(df, columns=['Location'])
dum_df

In [None]:
dum_df = pd.get_dummies(data, columns=['Location', 'Sold by', 'Type', 'Rooms no.', 'Bathroom no.', 'Parking'])
dum_df

In [None]:
dum_df.columns

---
## Porządek zmiennych kategorycznych

![Clockface](img/clock.png)

### zmieniamy na współrzędne "wskazówek"

### $ m \to ( \sin{(\frac{2\Pi\:m}{12})}, \cos{(\frac{2\Pi\:m}{12})} )$

In [None]:
df

In [None]:
import numpy as np

df['month_x'] = df['month'].apply(lambda x: np.sin(np.pi*int(x)/12))
df['month_y'] = df['month'].apply(lambda x: np.cos(np.pi*int(x)/12))
df

---
# Dobór cech

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine

wine_data = load_wine()
wine_df = pd.DataFrame(
    data=wine_data.data, 
    columns=wine_data.feature_names)
wine_df['target'] = wine_data.target

In [None]:
wine_df

In [None]:
from sklearn.model_selection import train_test_split

X = wine_df.drop(['target'], axis=1)
y = wine_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.3, 
                                                    shuffle=True, 
                                                    stratify=y)


### - `shuffle` - dane mają być losowo "potasowane"
### - `stratify`  - poszczególne klasy mają być reprezentowane proporcjonalnie w zbiorze testowym


In [None]:
X_train.var(axis=0)

In [None]:
from sklearn.preprocessing import Normalizer
norm = Normalizer().fit(X_train)
norm_X_train = norm.transform(X_train)
norm_X_train.var(axis=0)

## "Ręczne" zrzucanie kolumn

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

dt = DecisionTreeClassifier(random_state=42)

dt.fit(X_train, y_train)

preds = dt.predict(X_test)
f1_score_all = round(f1_score(y_test, preds, average='weighted'),3)
f1_score_all



In [None]:
X_train_sel = X_train.drop(['hue', 'nonflavanoid_phenols'], axis=1)
X_test_sel = X_test.drop(['hue', 'nonflavanoid_phenols'], axis=1)
dt.fit(X_train_sel, y_train)
preds_sel = dt.predict(X_test_sel)
f1_score_sel = round(f1_score(y_test, preds_sel, average='weighted'), 3)
f1_score_sel

### Eliminacja kolumn o niskiej zmienności

In [None]:
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold = 1e-7)
selected_features = selector.fit_transform(norm_X_train)
selected_features.shape

## Zrzucanie kolumn testem $\chi^2$

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X_train_v2, X_test_v2, y_train_v2, y_test_v2 = X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy()
f1_score_list = []
for k in range(1, 14):
    selector = SelectKBest(chi2, k=k)
    selector.fit(X_train_v2, y_train_v2)
    
    sel_X_train_v2 = selector.transform(X_train_v2)
    sel_X_test_v2 = selector.transform(X_test_v2)
    
    dt.fit(sel_X_train_v2, y_train_v2)
    kbest_preds = dt.predict(sel_X_test_v2)
    f1_score_kbest = round(f1_score(y_test, kbest_preds, average='weighted'), 3)
    f1_score_list.append(f1_score_kbest)

print(f1_score_list)

In [None]:
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import numpy as np
plt.style.use("dark_background")

fig, ax = plt.subplots(figsize=(12, 6))
x = list(range(1,14))
y = f1_score_list
ax.bar(x, y, width=0.4)
ax.set_xlabel('Ilość wymiarów wybranych testem chi2')
ax.set_ylabel('F1-Score (weighted)')
ax.set_ylim(0, 1.2)
for index, value in enumerate(y):
    plt.text(x=index+1, y=value + 0.05, s=str(value), ha='center')
    
plt.tight_layout()

## Recursive Feature Elimination

### - korzystając z osobnego estymatora posiadającego `coef_` lub `feature_importance_` iteracyjnie odrzucane są kolejne wagi

In [None]:
from sklearn.feature_selection import RFE

X_train_v3, X_test_v3, y_train_v3, y_test_v3 = X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy()
RFE_selector = RFE(estimator=dt, n_features_to_select=4, step=1)
RFE_selector.fit(X_train_v3, y_train_v3)

In [None]:
X_train_v3.columns[RFE_selector.support_]

In [None]:
sel_X_train_v3 = RFE_selector.transform(X_train_v3)
sel_X_test_v3 = RFE_selector.transform(X_test_v3)
dt.fit(sel_X_train_v3, y_train_v3)
RFE_preds = dt.predict(sel_X_test_v3)
rfe_f1_score = round(f1_score(y_test_v3, RFE_preds, average='weighted'),3)
print(rfe_f1_score)

## Select from model

### - korzystając z osobnego estymatora posiadającego `coef_` lub `feature_importance_` 

In [None]:
from sklearn.feature_selection import SelectFromModel

X_train_v4, X_test_v4, y_train_v4, y_test_v4 = X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy()

sfm_selector = SelectFromModel(estimator=DecisionTreeClassifier())
sfm_selector.fit(X_train_v4,  y_train_v4)

In [None]:
X_train_v4.columns[sfm_selector.get_support()]

In [None]:
sel_X_train_v4 = sfm_selector.transform(X_train_v4)
sel_X_test_v4 = sfm_selector.transform(X_test_v4)

dt.fit(sel_X_train_v4, y_train_v4)
sfm_preds = dt.predict(sel_X_test_v4)
sfm_f1_score = round(f1_score(y_test_v4, sfm_preds, average='weighted'),3)
print(rfe_f1_score)