# Machine Learning 4

### - Feature Engineering
### - Ostateczny kształt `Pipeline`
### - Problemy z trenowaniem modelu
### - Materiały do dalszej nauki
### - Projekt do realizacji

---
## Feature Engineering

---
# The features you use influence more than everything else the result. 
# No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.
# <div style="text-align: right">— Luca Massaron Autor, Kaggle master</div>

---

# Coming up with features is difficult, time-consuming, requires expert knowledge.
# "_*Applied machine learning*_" is basically feature engineering.
# <div style="text-align: right">— Andrew Ng</div>


---
## Techniki Inżynierii Wymiarów
### Liczby
- Binaryzacja
- Kubełkowanie (stała szerokość lub kwantyle)
- Skalowanie 
  - wygładzanie (dodanie +1 - _*Wygładzanie Laplace'a*_)
  - min-max 
  - logarytm 
  - standaryzacja (skalowanie o wariancję)
  - NIE gubimy zer! (dane rzadkie w gęste przez np. odjęcie średniej)
  - zaawansowane skalowanie: TF-IDF

### Agregacje
- sumy, średnie, wariancje, dalsze momenty (np. per kategoria)
- przejście z liczb bezwzględnych na względne (np. w kategorii)
- przejście na z wartości na rank (kolejność) 


### Kategorie
- dummy encoding
- feature hashing
- redukcja wymiarów

### Kategorie porządkowe np. daty
- rozbicie na elementy (dzień, miesiąc, rok, kwartał, dzień tygodnia, dzień miesiąca)
- przejście na zmienne biegunowe

![Zmienne biegunowe](Picture1.png)

---
## Ostateczny kształt `Pipeline`

In [None]:
import pandas as pd
from numpy import log2

data = pd.read_csv('adverts_29_04.csv', sep=';')

data['cena_za_metr'] = data['Cena'] / data['Wielkość (m2)']
data["log"] = data['Wielkość (m2)'].apply(lambda x: log2(x))
data['msc'] = data['Data dodania'].apply(lambda x: x[3:])

data = data.dropna(subset=['cena_za_metr'])

df = data.drop(['Cena', 'Data dodania'], axis=1)

dum_df = pd.get_dummies(df, columns=['msc', 'Lokalizacja', 'Na sprzedaż przez', 'Rodzaj nieruchomości', 'Liczba pokoi', 'Liczba łazienek', 'Parking'])


In [None]:
import gzip
import sys
import re
import re

splitter = re.compile(r'[^ąąćęńłóóśśżżź\w]+')
isnumber = re.compile(r'[0-9]')

f = gzip.open('odm.txt.gz', 'rt', encoding='utf-8')
dictionary = {}
set_dict= set()

for x in f:
    t = x.strip().split(',')
    tt = [ x.strip().lower() for x in t]
    for w in tt:
        set_dict.add(w)
        dictionary[w]=tt[0]

def lematize(w):
    w = w.replace('ą','ą')
    w = w.replace('ó','ó')
    w = w.replace('ę','ę')
    w = w.replace('ż','ż')
    return dictionary.get(w,w)

opis1 = dum_df['opis'][0]



raw_corpus=[]
n=0

for i in dum_df.iterrows():
    n+=1
    l = list(splitter.split(i[1][1]))
    raw_corpus.append(l)

    
all_words = []
for t in raw_corpus:
    all_words[0:0] = t

words = {}
for w in all_words:
    rec = words.get(w.lower(), {'upper':0, 'lower': 0})
    if w.lower()==w or w.upper()==w:
        rec['lower'] = rec['lower'] +1
    else: 
        rec['upper'] = rec['upper'] +1
    words[w.lower()] = rec

raw_stop_words = [ x for x in words.keys() if words[x]['upper']>=words[x]['lower']*4 ]   

set_raw_stop_words = set(raw_stop_words)

def preprocessing(opis, filter_raw=True, filter_dict=True):
    opis = str(opis)
    tokenized = splitter.split(opis)
    l = list(tokenized)
    l = [ x.lower() for x in l ]
    l = [ x for x in l if len(x) > 2]
    l = [ x for x in l if x.find('_') < 0]
    l = [ x for x in l if isnumber.search(x) is None ]
    if filter_raw: l = [ x for x in l if x not in set_raw_stop_words ]
    if filter_dict: l = [ x for x in l if x in set_dict ]
    l = [ lematize(x) for x in l ]
    l = [ x for x in l if len(x) > 2]
    return l

In [None]:
opis1

In [None]:
print(preprocessing(opis1))

In [None]:
print(preprocessing(opis1, filter_raw=False))

In [None]:
print(preprocessing(opis1, filter_dict=False))

In [None]:
print(preprocessing(opis1, filter_raw=False, filter_dict=False))

In [None]:
dum_df["opisTT"] = dum_df["opis"].apply(lambda x: ' '.join(preprocessing(x,filter_raw=True, filter_dict=True)))
dum_df["opisTF"] = dum_df["opis"].apply(lambda x: ' '.join(preprocessing(x,filter_raw=True, filter_dict=False)))
dum_df["opisFT"] = dum_df["opis"].apply(lambda x: ' '.join(preprocessing(x,filter_raw=False, filter_dict=True)))
dum_df["opisFF"] = dum_df["opis"].apply(lambda x: ' '.join(preprocessing(x,filter_raw=False, filter_dict=False)))

```python
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from time import time
from sklearn.preprocessing import StandardScaler, Normalizer, RobustScaler
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import FeatureUnion

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key=''):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

class ItemUnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, keys=[]):
        self.keys = keys

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict.drop(self.keys, axis=1)


pipeline = Pipeline([
   ('union', 
        FeatureUnion(
            transformer_list=[
                ('table', 
                    Pipeline([
                        ('selector1', ItemUnSelector(keys=['opis', 'opisTT', 'opisTF', 'opisFT', 'opisFF'])),
                        ('scaler1', 'passthrough')
                    ])
                ),
                ('description', 
                    Pipeline([
                        ('selector2', ItemSelector()),
                        ('tfidf', TfidfVectorizer()),
                        ('best', TruncatedSVD()),
                        ('scaler2', 'passthrough')
                    ])
                )
            ]
        )    

   ),
   ('regressor', 
        TransformedTargetRegressor()
    )
])

parameters = parameters = {
    'union__transformer_weights': [ { 'table': 3.0, 'description': 1.0}, { 'table': 2.0, 'description': 1.0}, { 'table': 1.0, 'description': 1.0}],

    'union__description__best__n_components': (650, 700, 750),
    'union__description__tfidf__min_df': (3, 4, 5),
    'union__description__tfidf__binary': (True,False),
    'union__description__selector2__key': ['opisTT', 'opisTF', 'opisFT', 'opisFF'] ,
    
    'union__table__scaler1': ['passthrough', StandardScaler(), Normalizer(), RobustScaler()],
    'union__description__scaler2': ['passthrough', StandardScaler(), Normalizer(), RobustScaler(with_centering=False)],
    
    'regressor': [SVR(kernel='rbf', C=10000), SVR(kernel='linear', C=10000), GradientBoostingRegressor()] ,
}

grid_search = GridSearchCV(pipeline, parameters, verbose=1, cv=10, n_jobs=-1)


y = dum_df['cena_za_metr']
X = dum_df.drop(['cena_za_metr'], axis=1)

t0 = time()
grid_search.fit(X, y)
print("done in %0.3fs" % (time() - t0))

print("Best parameters set:")
print(grid_search.cv_results_)
print(grid_search.best_score_)
print()
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
```

---
## Problemy z trenowaniem modelu


### To ile tych prób mamy ?

- 3 zestawy wag `union`
- 3 zestawy wymiarów SVD
- 6 zestawów parametrów TF-IDF
- 4 zbiory danych tekstowych
- 4 mechanizmy skalowania części `table`
- 4 mechanizmy skalowania części `description`
- 3 regresory
- 10 walidacji krzyżowych

In [None]:
3 * 3 * 6 * 4 * 4 * 4 * 3 * 10

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from time import time
from sklearn.preprocessing import StandardScaler, Normalizer, RobustScaler
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import FeatureUnion

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key=''):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

class ItemUnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, keys=[]):
        self.keys = keys

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict.drop(self.keys, axis=1)


pipeline = Pipeline([
   ('union', 
        FeatureUnion(
            transformer_list=[
                ('table', 
                    Pipeline([
                        ('selector1', ItemUnSelector(keys=['opis', 'opisTT', 'opisTF', 'opisFT', 'opisFF'])),
                        ('scaler1', 'passthrough')
                    ])
                ),
                ('description', 
                    Pipeline([
                        ('selector2', ItemSelector()),
                        ('tfidf', TfidfVectorizer()),
                        ('best', TruncatedSVD()),
                        ('scaler2', 'passthrough')
                    ])
                )
            ]
        )    

   ),
   ('regressor', 
        TransformedTargetRegressor()
    )
])

parameters = parameters = {
    'union__transformer_weights': [  { 'table': 1.0, 'description': 1.0}],

    'union__description__best__n_components': (700,),
    'union__description__tfidf__min_df': (3,),
    'union__description__tfidf__binary': (True,),
    'union__description__selector2__key': [ 'opisFF'] ,
    
    'union__table__scaler1': [ RobustScaler()],
    'union__description__scaler2': [ RobustScaler(with_centering=False)],
    
    'regressor': [ GradientBoostingRegressor()] ,
}

grid_search = GridSearchCV(pipeline, parameters, verbose=1, cv=10, n_jobs=-1)


y = dum_df['cena_za_metr']
X = dum_df.drop(['cena_za_metr'], axis=1)

t0 = time()
grid_search.fit(X, y)
print("done in %0.3fs" % (time() - t0))

print(f'Best score: {grid_search.best_score_}')

print("Best parameters set:")
print()
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

In [None]:
secs = 3 * 3 * 6 * 4 * 4 * 4 * 3 * 30 

In [None]:
secs/(3600*24)

---
## Poprawa skuteczności
### Więcej informacji
- poziomo - więcej danych (skąd ? czy to nie zaburzy modelu ?)
- pionowo - więcej wymiarów
  - więcej danych (zdjęcia ?)
  - więcej wymiarów - Feature Engineering
---
## Materiały do dalszej nauki
- Udemy - https://www.udemy.com/course/introduction-to-data-science-using-python/
- Udemy - https://www.udemy.com/course/python-scrapy-for-beginners/
- edX - https://www.edx.org/course/introduction-to-python-for-data-science-2
- Coursera - IBM https://www.coursera.org/learn/python-for-applied-data-science-ai
- Coursera - Stanford Machine Learning https://www.coursera.org/learn/machine-learning

### Tego jest dużo ...
https://www.forbes.com/sites/bernardmarr/2020/02/24/the-9-best-free-online-data-science-courses-in-2020/
https://www.dataquest.io/blog/free-books-learn-data-science/
100+ - https://www.learndatasci.com/free-data-science-books/

https://jakevdp.github.io/PythonDataScienceHandbook/

---
# Temat Projektu

- Pobierz dane (`Scrapy`, `requests` ...) - ok. 1000 rekordów (im więcej, tym lepiej)
- Przygotuj dane do analizy (`Beautiful Soup`, `lxml`) 
- Zbuduj `Pipeline`
- Wytrenuj jak najlepszy model

---