# Warsztaty Python w Data Science

---
## Inżynieria Cech (Feature Engineering) - część 2 z 3  

- ### Przekleństwo wymiarowości
- ### Umiejętne przygotowanie modeli wektorowych
- ### Dobór cech
- ### Redukcja wymiarów
  - #### SVD
  - #### LDA


---
# Przekleństwo wymiarowości

![Dimension levels](img/Dimension_levels.svg.png)  

https://en.wikipedia.org/wiki/Hypercube

- ## Przekątna kostki 1-wymiarowej o boku 1:  $ d = \sqrt{1} $
- ## Przekątna kostki 2-wymiarowej o boku 1:  $ d = \sqrt{2} $
- ## Przekątna kostki 3-wymiarowej o boku 1:  $ d = \sqrt{3} $
- ## Przekątna kostki 4-wymiarowej o boku 1:  $ d = \sqrt{4} $


---
## Umiejętne przygotowanie modeli wektorowych

In [1]:
import pandas as pd

data = pd.read_csv('data/gumtree-2022-03-20.tsv', sep='\t', index_col=0)

In [2]:
data

Unnamed: 0,url,title,description,price,District,m2,Data dodania,Na sprzedaż przez,Rodzaj nieruchomości,Liczba pokoi,Liczba łazienek
0,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,Mieszkanie Warszawa Śródmieście 93.9m2 (nr: M-...,<br/><br/><i>Numer oferty w biurze: M-101293-1...,1500000.0,Śródmieście,94.0,06/03/2022,Agencja,Mieszkanie,3 pokoje,1 łazienka
1,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,Dom Warszawa Wawer 185m2 (nr: D-100575-16),<br/><br/><i>Numer oferty w biurze: D-100575-1...,1580000.0,Wawer,185.0,06/03/2022,Agencja,Dom,6 lub więcej pokoi,2 łazienki
2,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,Mieszkanie Warszawa Ochota 39.2m2 (nr: M-10171...,<br/><br/><i>Numer oferty w biurze: M-101712-1...,500000.0,Ochota,40.0,06/03/2022,Agencja,Mieszkanie,2 pokoje,1 łazienka
3,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,Mieszkanie Warszawa Wilanów 51.4m2 (nr: M-1008...,<br/><br/><i>Numer oferty w biurze: M-100859-1...,858628.0,Wilanów,52.0,06/03/2022,Agencja,Mieszkanie,2 pokoje,1 łazienka
4,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,Mieszkanie Warszawa Białołęka 92m2 (nr: M-9786...,<br/><br/><i>Numer oferty w biurze: M-97860-16...,899000.0,Białołęka,92.0,06/03/2022,Agencja,Mieszkanie,4 pokoje,1 łazienka
...,...,...,...,...,...,...,...,...,...,...,...
970,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,》》》Mokotów / świetna lokalizacja / 3p / 2 balk...,<b>Na sprzedaż trzypokojowe mieszkanie w świet...,1035000.0,Mokotów,69.0,05/03/2022,Agencja,Mieszkanie,3 pokoje,1 łazienka
971,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,》》》Wilanów / 3p / balkon / 2 łazienki / rozkł...,"<b>Świetne, rozkładowe mieszkanie trzypokojowe...",650000.0,Wilanów,64.0,05/03/2022,Agencja,Mieszkanie,3 pokoje,2 łazienki
972,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,》》》Ursynów / 3p / świetna lokalizacja / widna ...,<b>Ustawne i rozkładowe mieszkanie trzypokojow...,899000.0,Ursynów,63.0,05/03/2022,Agencja,Mieszkanie,3 pokoje,2 łazienki
973,https://www.gumtree.pl/a-mieszkania-i-domy-spr...,》》》Mokotów / 2p / świetna lokalizacja / możliw...,"<b>Rozkładowe i przestronne, dwupokojowe miesz...",770000.0,Mokotów,70.0,05/03/2022,Agencja,Mieszkanie,2 pokoje,2 łazienki


In [3]:
opis = data['description'][6]
opis

'<p>Sprzedam mieszkanie o powierzchni\r\n45,05 m2 składające się z trzech pokoi, kuchni, łazienki, przedpokoju oraz\r\nbalkonu. Do mieszkania przynależy obszerna piwnica.</p><p>Mieszkanie jest bardzo\r\njasne i rozkładowe, po remoncie, gotowe do zamieszkania.<br/>\n<br/>\r\nMieści się na 1 piętrze w 8-piętrowym budynku (z windą).<br/>\n<br/>\r\nCzynsz ok 385 zł<br/>\n<br/>\r\nStatus prawny: pełna własność z księgą wieczystą<br/>\n<br/>\r\nDużym atutem mieszkania jest doskonała lokalizacja – ul.\r\nKasprowicza 20 na Starych Bielanach. Bardzo dobra komunikacja - 300 m do\r\nstacji metra Słodowiec, autobusy i tramwaje.<br/>\n<br/>\r\nW okolicy\r\nliczne sklepy, punkty usługowe i gastronomiczne, zielona i cicha okolica, w\r\npobliżu  Lasek Bielański, Stawy Kellera, AWF.</p><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p>\r\n\r\n\r\n\r\nAgencje proszę nie dzwonić, ewentualne spotkania\r\ntylko z konkretnym klientem.<br/>\n<br/>\n</p><br/><br/>'

In [4]:
import re

def no_tags(s):
    return re.sub(r'<[^<]+?>','',s)


In [5]:
import re

tokenizer = re.compile(r'[^ąąćęńłóóśśżżź\w]+')

In [6]:
def preprocessing(opis):
    opis = no_tags(opis)
    tokenized = tokenizer.split(opis)
    l = list(tokenized)
    l = [ x.lower() for x in l ]
    return l

In [7]:
corpus=[]

for row in data.iterrows():
    try:
        opis = row[1][2]
        l = preprocessing(opis)
        corpus.append(l)
    except Exception as e:
        pass

for opis in corpus[:7]:
    print(opis)

['numer', 'oferty', 'w', 'biurze', 'm', '101293', '11']
['numer', 'oferty', 'w', 'biurze', 'd', '100575', '16']
['numer', 'oferty', 'w', 'biurze', 'm', '101712', '11']
['numer', 'oferty', 'w', 'biurze', 'm', '100859', '16']
['numer', 'oferty', 'w', 'biurze', 'm', '97860', '16']
['numer', 'oferty', 'w', 'biurze', '2009']
['sprzedam', 'mieszkanie', 'o', 'powierzchni', '45', '05', 'm2', 'składające', 'się', 'z', 'trzech', 'pokoi', 'kuchni', 'łazienki', 'przedpokoju', 'oraz', 'balkonu', 'do', 'mieszkania', 'przynależy', 'obszerna', 'piwnica', 'mieszkanie', 'jest', 'bardzo', 'jasne', 'i', 'rozkładowe', 'po', 'remoncie', 'gotowe', 'do', 'zamieszkania', 'mieści', 'się', 'na', '1', 'piętrze', 'w', '8', 'piętrowym', 'budynku', 'z', 'windą', 'czynsz', 'ok', '385', 'zł', 'status', 'prawny', 'pełna', 'własność', 'z', 'księgą', 'wieczystą', 'dużym', 'atutem', 'mieszkania', 'jest', 'doskonała', 'lokalizacja', 'ul', 'kasprowicza', '20', 'na', 'starych', 'bielanach', 'bardzo', 'dobra', 'komunikacja', 

In [8]:
import gzip
import sys
import re

f = gzip.open('data/odm.txt.gz', 'rt', encoding='utf-8')
dictionary = {}

for x in f:
    t = x.strip().split(',')
    tt = [ x.strip().lower() for x in t]
    for w in tt[1:]: 
        dictionary[w]=tt[0]


In [9]:
def lematize(w):
    return dictionary.get(w,w)

In [10]:
corpusl = [[ lematize(x) for x in l ] for l in corpus]
for opis in corpusl[:7]:
    print(opis)

['numer', 'oferta', 'w', 'biuro', 'm', '101293', '11']
['numer', 'oferta', 'w', 'biuro', 'd', '100575', '16']
['numer', 'oferta', 'w', 'biuro', 'm', '101712', '11']
['numer', 'oferta', 'w', 'biuro', 'm', '100859', '16']
['numer', 'oferta', 'w', 'biuro', 'm', '97860', '16']
['numer', 'oferta', 'w', 'biuro', '2009']
['sprzedać', 'mieszkać', 'o', 'powierzchnia', '45', '05', 'm2', 'składać', 'siebie', 'z', 'trzy', 'pokój', 'kuchnia', 'łazienka', 'przedpokój', 'oraz', 'balkon', 'do', 'mieszkanie', 'przynależeć', 'obszerny', 'piwnica', 'mieszkać', 'być', 'bardzo', 'jasny', 'i', 'rozkładowy', 'po', 'remont', 'gotowy', 'do', 'zamieszkać', 'mieścić', 'siebie', 'na', '1', 'piętro', 'w', '8', 'piętrowy', 'budynek', 'z', 'winda', 'czynsz', 'oko', '385', 'zł', 'status', 'prawny', 'pełny', 'własność', 'z', 'księga', 'wieczysty', 'duży', 'atut', 'mieszkanie', 'być', 'doskonały', 'lokalizacja', 'ula', 'kasprowicz', '20', 'na', 'starzy', 'bielan', 'bardzo', 'dobry', 'komunikacja', '300', 'm', 'do', 'st

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [ ' '.join(words) for words in corpusl]

tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(corpus)

In [13]:
feature_names = tfidf.get_feature_names()
print(feature_names[:150])

['00', '000', '000zł', '002', '009', '00andrzej', '01', '014', '02', '025', '03', '035', '05', '06', '062', '06m2', '07m2', '07m23', '08', '085elżbieta', '087', '094', '0m', '10', '100', '1000', '10000', '100016', '100057', '100069', '100088', '100100', '100196', '100197', '100199', '100323', '100348', '100351', '100363', '100380', '100448', '100492', '100512', '100574', '100575', '100587', '100618', '100625', '100647', '100649', '100654', '100668', '10066biuro', '100695', '10069biuro', '100741', '100788', '100795', '100817', '100835', '100859', '100895', '100938', '100m', '100m2', '101', '101029', '101039', '101086', '101088', '101103', '101130', '101191', '101203', '101242', '101244', '101264', '101265', '101293', '101302', '101328', '101354', '101373', '101375', '101378', '101393', '101394', '101395', '101441', '101488', '101525', '101540', '101543', '101565', '101624', '101643', '101647', '101653', '101654', '101663', '101690', '101712', '101719', '101720', '101734', '102', '105', 

In [14]:
corpus_index = range(len(corpus))
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index).T
df

Unnamed: 0,00,000,000zł,002,009,00andrzej,01,014,02,025,...,żolibórz,życzyć,żygońdamian,żygońinspace,żywica,żywopłot,żywy,żyć,żłobek,żłóbek
0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,0.0,0.104155,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
931,0.0,0.114765,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
932,0.0,0.113898,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
933,0.0,0.000000,0.240107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
corpus = [ [ word for word in words if not word.isnumeric() ] for words in corpusl ]
nn_corpus = [ ' '.join(words) for words in corpus ]

tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(nn_corpus)

In [16]:
feature_names = tfidf.get_feature_names()
print(feature_names[:150])

['000zł', '00andrzej', '06m2', '07m2', '07m23', '085elżbieta', '0m', '10066biuro', '10069biuro', '100m', '100m2', '10m2', '10min', '10tym', '110m2', '118bujak', '11m', '11m2', '11piwnica', '1273m2', '12cm', '12ej', '12m', '12m2', '130m2', '130m2komunikacja', '13m2', '13m2okolice', '13m2w', '140x200', '14m2', '14tys', '150m', '15m2', '15tys', '160m2', '160polecam', '160polecam504', '16ej', '16m2', '16m2lokal', '170zł', '17m2', '180cm', '182w', '18m', '18m2', '18m26', '18m5', '190cm', '1911roku', '1937roku', '1963winda', '1994r', '19m2', '1klasyczny', '1m', '1m2', '1mieszkanie', '1nietaknieco', '1okna', '1os', '1osoba', '1taras', '2008r', '200m', '2011najbliższe', '2015r', '2018r', '2021r', '2022r', '20m2', '20min', '21m', '24h', '254678s', '254679s', '254680s', '25m2', '267cm', '270cm', '273facebook', '27m2', '28m2', '29a', '2cena', '2m', '2m2', '2mdodatkowo', '2os', '300m', '300mb', '308paweł', '30m2', '30tys', '31mkw', '33m2', '348malinowski', '34m2', '34na', '350m', '35zapraszam', '3

In [17]:
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index).T
df

Unnamed: 0,000zł,00andrzej,06m2,07m2,07m23,085elżbieta,0m,10066biuro,10069biuro,100m,...,żolibórz,życzyć,żygońdamian,żygońinspace,żywica,żywopłot,żywy,żyć,żłobek,żłóbek
0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
931,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
932,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
933,0.246513,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
tfidf = TfidfVectorizer(min_df=2)
tfs = tfidf.fit_transform(nn_corpus)
feature_names = tfidf.get_feature_names()
print(feature_names[:150])

['085elżbieta', '100m', '10m2', '11m2', '12ej', '12m2', '14m2', '15m2', '16ej', '16m2', '18m', '18m26', '18m5', '19m2', '1mieszkanie', '2008r', '200m', '20min', '24h', '25m2', '273facebook', '28m2', '2m', '300m', '38m2', '3m2', '42m2', '45m2', '46m2', '48m22', '4m', '4m2', '56m2', '56mmieszkanie', '5m', '5m2', '608648510nestor', '6m2', '700zł', '75m2', '799agencjom', '799w', '7m', '7m2', '869zapraszam', '8m2', '93opisy', '9m', 'aby', 'aco', 'administracyjny', 'administrować', 'adres', 'aga', 'agd', 'agencja', 'agent', 'aktualny', 'aktywny', 'al', 'ala', 'aleja', 'aluzyjny', 'amica', 'aneks', 'antresola', 'antysmogowystrony', 'antywłamaniowy', 'apartament', 'apartamentowiec', 'apartamentowy', 'apartamentyostródzkaw', 'app', 'apteka', 'aranżacja', 'aranżacyjny', 'architekt', 'architektoniczny', 'architektura', 'are', 'arkadia', 'armatura', 'art', 'arteria', 'atrakcja', 'atrakcyjny', 'atrium', 'atut', 'auchan', 'aut', 'auto', 'autobus', 'autobusowy', 'aż', 'balkon', 'balkondo', 'balkonowy

In [19]:
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index).T
df

Unnamed: 0,085elżbieta,100m,10m2,11m2,12ej,12m2,14m2,15m2,16ej,16m2,...,świeżość,świeży,żabka,żaden,że,żolibórz,życzyć,żywy,żyć,żłobek
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
932,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
all_words = []
for t in corpus:
    all_words += t
 
print(all_words[:15])

['numer', 'oferta', 'w', 'biuro', 'm', 'numer', 'oferta', 'w', 'biuro', 'd', 'numer', 'oferta', 'w', 'biuro', 'm']


In [21]:
counter = {}

for w in all_words:
    counter[w] = counter.get(w,0)+1


In [22]:
from operator import itemgetter
counted_words= [ (word,cnt) for word,cnt in counter.items() ]
counted_words.sort(key=itemgetter(1), reverse=True)
counted_words[:20]

[('w', 2066),
 ('z', 1110),
 ('na', 972),
 ('i', 916),
 ('oferta', 843),
 ('do', 803),
 ('biuro', 720),
 ('numer', 701),
 ('mieszkać', 624),
 ('być', 557),
 ('siebie', 440),
 ('mieszkanie', 377),
 ('m2', 372),
 ('on', 302),
 ('budynek', 287),
 ('oraz', 286),
 ('oda', 239),
 ('o', 225),
 ('łazienka', 224),
 ('kuchnia', 221)]

In [23]:
stop_words = [ word[0] for word in counted_words[:150]]
print(stop_words)

['w', 'z', 'na', 'i', 'oferta', 'do', 'biuro', 'numer', 'mieszkać', 'być', 'siebie', 'mieszkanie', 'm2', 'on', 'budynek', 'oraz', 'oda', 'o', 'łazienka', 'kuchnia', 'duży', 'okno', 'm', 'znajdywać', 'ten', '', 'miejsce', 'pokój', 'salon', 'powierzchnia', 'ula', 'zł', 'stan', 'bardzo', 'metr', 'piętro', 'lokal', 'przy', 'rok', 'okolica', 'nieruchomość', 'balkon', 'oko', 'sprzedaż', 'dla', 'lokalizacja', 'dodatkowy', 'warszawa', 'cena', 'dobry', 'osiedle', 'przedpokój', 'nowy', 'po', 'zapraszać', 'dwa', 'sypialnia', 'cichy', 'oms', 'blok', 'minuta', 'przystanek', 'pod', 'możliwość', 'centrum', 'remont', 'bezpośredni', 'zostać', 'składać', 'piwnica', 'park', 'mina', 'księga', 'pełny', 'świetny', 'zielony', 'wc', 'autobusowy', 'przynależeć', 'winda', 'położyć', 'przestronny', 'wieczysty', 'pobliże', 'tylka', 'oddzielny', 'lubić', 'stacja', 'jaka', 'wysokość', 'czynsz', 'a', 'przedszkole', 'szkoła', 'widny', 'klatka', 'za', 'własnościowy', 'dom', 'liczny', 'kuchenny', 'idealny', 'możny', 's

In [24]:
tfidf = TfidfVectorizer(stop_words=stop_words , min_df=2)
tfs = tfidf.fit_transform(nn_corpus)
feature_names = tfidf.get_feature_names()
print(feature_names[:150])

['085elżbieta', '100m', '10m2', '11m2', '12ej', '12m2', '14m2', '15m2', '16ej', '16m2', '18m', '18m26', '18m5', '19m2', '1mieszkanie', '2008r', '200m', '20min', '24h', '25m2', '273facebook', '28m2', '2m', '300m', '38m2', '3m2', '42m2', '45m2', '46m2', '48m22', '4m', '4m2', '56m2', '56mmieszkanie', '5m', '5m2', '608648510nestor', '6m2', '700zł', '75m2', '799agencjom', '799w', '7m', '7m2', '869zapraszam', '8m2', '93opisy', '9m', 'aby', 'aco', 'administracyjny', 'administrować', 'adres', 'aga', 'agd', 'agencja', 'agent', 'aktualny', 'aktywny', 'al', 'ala', 'aleja', 'aluzyjny', 'amica', 'antresola', 'antysmogowystrony', 'antywłamaniowy', 'apartament', 'apartamentowiec', 'apartamentowy', 'apartamentyostródzkaw', 'app', 'apteka', 'aranżacja', 'aranżacyjny', 'architekt', 'architektoniczny', 'architektura', 'are', 'arkadia', 'armatura', 'art', 'arteria', 'atrakcja', 'atrakcyjny', 'atrium', 'atut', 'auchan', 'aut', 'auto', 'autobus', 'aż', 'balkondo', 'balkonowy', 'balkontoaleta', 'balkony6', '

In [25]:
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index).T
df

Unnamed: 0,085elżbieta,100m,10m2,11m2,12ej,12m2,14m2,15m2,16ej,16m2,...,świeżość,świeży,żabka,żaden,że,żolibórz,życzyć,żywy,żyć,żłobek
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
932,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---
## Dobór cech

In [27]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.01)
sel.fit_transform(df).shape

(935, 2)

In [28]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.001)
sel.fit_transform(df).shape

(935, 14)

In [29]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.0001)
sel.fit_transform(df).shape

(935, 679)

```python

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

sel = SelectKBest(chi2, k=200).fit_transform(X, y)
sel.fit_transform(df).shape
```

## UWAGA

Feature selection with sparse data

If you use sparse data (i.e. data represented as sparse matrices), chi2, mutual_info_regression, mutual_info_classif will deal with the data without making it dense.

https://scikit-learn.org/stable/modules/feature_selection.html

---
## Redukcja wymiarów - `sklearn.decomposition`: Matrix Decomposition

### SVD - Singular Value Decomposition

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD

pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words , min_df=2)),
                ('best', TruncatedSVD(n_components=150)),
])

decomposed = pipeline.fit_transform(nn_corpus)

df = pd.DataFrame(decomposed)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,140,141,142,143,144,145,146,147,148,149
0,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,-9.639294e-12,-2.145611e-10,0.367554,0.463623,2.433461e-08,0.051088,0.027558,-0.045120,0.035379,-0.022390,...,-0.007090,-0.019456,-0.007504,-0.068534,-0.068930,0.128401,-0.047899,-0.064381,-0.009168,-0.096663
931,5.613907e-12,-4.086090e-11,0.409622,0.533936,-9.822934e-09,0.064999,-0.033541,-0.051176,0.037089,-0.044297,...,-0.010599,-0.068685,0.057293,-0.023343,-0.016341,-0.050487,-0.003608,-0.013742,0.028543,0.044843
932,5.894088e-12,2.463094e-10,0.447617,0.588433,-1.169674e-08,0.069028,0.016553,0.039975,-0.011109,0.034740,...,0.009058,0.063044,0.009061,-0.002702,0.009649,0.009165,-0.004361,-0.057636,-0.020774,0.055984
933,-6.100923e-12,-6.897501e-11,0.445431,0.586308,1.385781e-08,0.072384,-0.021884,-0.051657,0.000061,-0.007252,...,0.026133,0.041835,0.006899,0.069943,0.032852,0.016389,-0.008876,-0.021480,-0.021429,0.012605


## Latent Semantic Analysis

## `TF-IDF` + `SVD` = LSA

---
## Latent Dirichlet Allocation - *"Topic Extraction"*

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import LatentDirichletAllocation

pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words , min_df=2)),
                ('topics', LatentDirichletAllocation(n_components=15)),
])

decomposed = pipeline.fit_transform(nn_corpus)

df = pd.DataFrame(decomposed)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667
1,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667
2,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667
3,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667
4,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,0.014219,0.014219,0.014219,0.141726,0.014219,0.014219,0.497873,0.014219,0.014219,0.189777,0.014219,0.014219,0.014219,0.014219,0.014219
931,0.012560,0.219755,0.012560,0.012560,0.012560,0.012560,0.616970,0.012560,0.012560,0.012560,0.012560,0.012560,0.012560,0.012560,0.012560
932,0.013349,0.013349,0.013349,0.013349,0.013349,0.013349,0.768371,0.013349,0.013349,0.058095,0.013349,0.013349,0.013349,0.013349,0.013349
933,0.013427,0.013427,0.013427,0.013427,0.013427,0.013427,0.812022,0.013427,0.013427,0.013427,0.013427,0.013427,0.013427,0.013427,0.013427
