# Warsztaty Python w Data Science

---
## Przetwarzanie Języka Naturalnego - część 2 z 2  

### Biblioteki NLP
#### - Scikit-Learn
#### - NLTK

#### - Gensim
#### - Spacy



### Scikit-Learn
- Potężna biblioteka do nauczania maszynowego
- Posiada wiele narzędzi do obróbki statystycznej danych tekstowych

### NLTK
- Biblioteka do przetwarzania języka naturalnego
- Posiada wiele narzędzi do obróbki statystycznej danych tekstowych
- KORPUSY
- [NLTK Book](https://www.nltk.org/book/)

### Spacy
- Dużo wytrenowanych modeli statystycznych do języka
- Chyba najlepsze modele do języka polskiego - morfologia, tagging (części mowy)

### Gensim
- Zbiór najnowszych algorytmów do obróbki danych tekstowych
- Word2Vec, CBOW etc. etc.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import re
import pandas as pd 

data = pd.read_csv('data\gumtree-2021-03-09.csv', sep='|')
columns = list(data.columns)
columns[0] = "Index"
data.columns=columns
data.set_index('Index', drop=True, inplace=True)
data.drop(["title", "url"], axis=1,inplace=True)

def no_tags(s):
    return re.sub(r'<[^<]+?>','',str(s))

data["description"] = data["description"].apply(no_tags)

In [2]:
data.head()

Unnamed: 0_level_0,description
Index,Unnamed: 1_level_1
0,Na sprzedaż piękna kawalerka o powierzchni 24 ...
1,"Mieszkanie dwupokojowe,własnościowe z 1971 r n..."
2,OPIS INWESTYCJI\n===============\nPOWER INVEST...
3,Bezpośrednio od dewelopera- brak prowizji 0%- ...
4,Na sprzedaż ekskluzywne mieszkanie dwupokojowe...


In [3]:
import gzip
import sys
import re

f = gzip.open('data/odm.txt.gz', 'rt', encoding='utf-8')
dictionary = {}

for x in f:
    t = x.strip().split(',')
    tt = [ x.strip().lower() for x in t]
    for w in tt[1:]: 
        dictionary[w]=tt[0]

def lematize(w):
    return dictionary.get(w,w)

In [4]:
import re

splitter = re.compile(r'[^ąąćęńłóóśśżżź\w]+')
isnumber = re.compile(r'[0-9]')

def preprocessing(opis):
    opis = str(opis)
    
    tokenized = splitter.split(opis)
    l = list(tokenized)
    l = [ x.lower() for x in l if len(x)>2 ]
    l = [ x for x in l if isnumber.search(x) is None ]
    l = [ lematize(x) for x in l ]
    return l

In [5]:
data["clean_description"] = data["description"].apply(lambda x: ' '.join(preprocessing(x)))

In [6]:
data.head()

Unnamed: 0_level_0,description,clean_description
Index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Na sprzedaż piękna kawalerka o powierzchni 24 ...,sprzedaż piękny kawalerka powierzchnia ostatni...
1,"Mieszkanie dwupokojowe,własnościowe z 1971 r n...",mieszkać dwupokojowy własnościowy pierwszy pię...
2,OPIS INWESTYCJI\n===============\nPOWER INVEST...,opis inwestycja power invest przyjemność zapre...
3,Bezpośrednio od dewelopera- brak prowizji 0%- ...,bezpośredni deweloper brak prowizja brak podat...
4,Na sprzedaż ekskluzywne mieszkanie dwupokojowe...,sprzedaż ekskluzywny mieszkać dwupokojowy powi...


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = {1: "The game of life is a game of everlasting learning", 
          2: "The unexamined life is not worth living", 
          3: "Never stop learning"}
tfidf = TfidfVectorizer(min_df=2)
tfs = tfidf.fit_transform(data["clean_description"])

feature_names = tfidf.get_feature_names()


In [8]:
print(list(feature_names[:50]))

['aby', 'aczkolwiek', 'adaptacja', 'adres', 'agd', 'agencja', 'agent', 'aktualność', 'aktualny', 'aktywny', 'al', 'alejka', 'aluzyjny', 'amator', 'amfiteatr', 'andrychowicznieruchomosci', 'aneks', 'antresola', 'antywłamaniowy', 'apartament', 'apartamentowiec', 'apteka', 'aranżacja', 'aranżacyjny', 'architektoniczny', 'architektura', 'arkadia', 'armatura', 'art', 'atmosfera', 'atrakcja', 'atrakcjekomunikacja', 'atrakcyjny', 'atut', 'aut', 'autobus', 'autobusowy', 'bagno', 'bajkowy', 'balance', 'balkon', 'bank', 'bar', 'bardzo', 'basen', 'baza', 'bazarek', 'bazia', 'bem', 'bemowo']


In [9]:
df = pd.DataFrame(tfs.toarray(), 
columns=tfidf.get_feature_names())

In [10]:
df

Unnamed: 0,aby,aczkolwiek,adaptacja,adres,agd,agencja,agent,aktualność,aktualny,aktywny,...,świetny,świeży,żaden,żerać,żerań,żerańogłoszenie,życzyć,żyto,żyć,żłobek
0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.000000,0.0
1,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.141596,0.0,0.000000,0.0
2,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.061946,0.0
3,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.093885,0.0
4,0.117810,0.0,0.0,0.000000,0.061704,0.0,0.0,0.0,0.059788,0.0,...,0.044939,0.0,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219,0.099201,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.100689,0.0,...,0.000000,0.0,0.0,0.308448,0.0,0.12965,0.000000,0.0,0.000000,0.0
220,0.097644,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.099108,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.000000,0.0
221,0.073988,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.075098,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.000000,0.0
222,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.00000,0.000000,0.0,0.000000,0.0


---
# NLTK

In [13]:
import pprint
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# nltk.download()

![NLTK Downloader](img\nltk.png)

In [14]:
opis = data['description'][0]
opis

'Na sprzedaż piękna kawalerka o powierzchni 24 m2 na ostatnim piętrze 10 piętrowego bloku z oknem wychodzącym na spokojną stronę osiedla. Bardzo dobrze skomunikowane z centrum (tramwaje ,autobusy).W pobliżu znajduje się dobra infrastruktura: sklepy, apteka, szkoła, targowisko ( hala Banacha),oraz park szczęśliwicki (5 minut na piechotę).Mieszkanie słoneczne i bardzo ustawne ,budynek po wymianie windy i elektryki w częściach wspólnych.Serdecznie zapraszamy do kontaktu.'

In [15]:
print(word_tokenize(opis))

['Na', 'sprzedaż', 'piękna', 'kawalerka', 'o', 'powierzchni', '24', 'm2', 'na', 'ostatnim', 'piętrze', '10', 'piętrowego', 'bloku', 'z', 'oknem', 'wychodzącym', 'na', 'spokojną', 'stronę', 'osiedla', '.', 'Bardzo', 'dobrze', 'skomunikowane', 'z', 'centrum', '(', 'tramwaje', ',', 'autobusy', ')', '.W', 'pobliżu', 'znajduje', 'się', 'dobra', 'infrastruktura', ':', 'sklepy', ',', 'apteka', ',', 'szkoła', ',', 'targowisko', '(', 'hala', 'Banacha', ')', ',', 'oraz', 'park', 'szczęśliwicki', '(', '5', 'minut', 'na', 'piechotę', ')', '.Mieszkanie', 'słoneczne', 'i', 'bardzo', 'ustawne', ',', 'budynek', 'po', 'wymianie', 'windy', 'i', 'elektryki', 'w', 'częściach', 'wspólnych.Serdecznie', 'zapraszamy', 'do', 'kontaktu', '.']


In [16]:
pprint.pprint(sent_tokenize(opis))

['Na sprzedaż piękna kawalerka o powierzchni 24 m2 na ostatnim piętrze 10 '
 'piętrowego bloku z oknem wychodzącym na spokojną stronę osiedla.',
 'Bardzo dobrze skomunikowane z centrum (tramwaje ,autobusy).W pobliżu '
 'znajduje się dobra infrastruktura: sklepy, apteka, szkoła, targowisko ( hala '
 'Banacha),oraz park szczęśliwicki (5 minut na piechotę).Mieszkanie słoneczne '
 'i bardzo ustawne ,budynek po wymianie windy i elektryki w częściach '
 'wspólnych.Serdecznie zapraszamy do kontaktu.']


In [17]:
tokens = [ word_tokenize(sentence) for sentence in sent_tokenize(opis)]
for sentence in tokens:
    print(sentence)

['Na', 'sprzedaż', 'piękna', 'kawalerka', 'o', 'powierzchni', '24', 'm2', 'na', 'ostatnim', 'piętrze', '10', 'piętrowego', 'bloku', 'z', 'oknem', 'wychodzącym', 'na', 'spokojną', 'stronę', 'osiedla', '.']
['Bardzo', 'dobrze', 'skomunikowane', 'z', 'centrum', '(', 'tramwaje', ',', 'autobusy', ')', '.W', 'pobliżu', 'znajduje', 'się', 'dobra', 'infrastruktura', ':', 'sklepy', ',', 'apteka', ',', 'szkoła', ',', 'targowisko', '(', 'hala', 'Banacha', ')', ',', 'oraz', 'park', 'szczęśliwicki', '(', '5', 'minut', 'na', 'piechotę', ')', '.Mieszkanie', 'słoneczne', 'i', 'bardzo', 'ustawne', ',', 'budynek', 'po', 'wymianie', 'windy', 'i', 'elektryki', 'w', 'częściach', 'wspólnych.Serdecznie', 'zapraszamy', 'do', 'kontaktu', '.']


In [18]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
raw[:75]

'\ufeffThe Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\r'

In [19]:
tokens = word_tokenize(raw)
text = nltk.Text(tokens)
text.collocation_list()

[('Katerina', 'Ivanovna'),
 ('Pyotr', 'Petrovitch'),
 ('Pulcheria', 'Alexandrovna'),
 ('Avdotya', 'Romanovna'),
 ('Rodion', 'Romanovitch'),
 ('Marfa', 'Petrovna'),
 ('Sofya', 'Semyonovna'),
 ('old', 'woman'),
 ('Project', 'Gutenberg-tm'),
 ('Porfiry', 'Petrovitch'),
 ('Amalia', 'Ivanovna'),
 ('great', 'deal'),
 ('young', 'man'),
 ('Nikodim', 'Fomitch'),
 ('Project', 'Gutenberg'),
 ('Ilya', 'Petrovitch'),
 ('Andrey', 'Semyonovitch'),
 ('Hay', 'Market'),
 ('Dmitri', 'Prokofitch'),
 ('Good', 'heavens')]

In [20]:
alice = nltk.corpus.gutenberg.fileids()[7]
al = nltk.corpus.gutenberg.words(alice)
al_text = nltk.Text(al)
al_text.collocation_list(25)

[('Mock', 'Turtle'),
 ('said', 'Alice'),
 ('March', 'Hare'),
 ('White', 'Rabbit'),
 ('thought', 'Alice'),
 ('golden', 'key'),
 ('beautiful', 'Soup'),
 ('white', 'kid'),
 ('good', 'deal'),
 ('kid', 'gloves'),
 ('Mary', 'Ann'),
 ('yer', 'honour'),
 ('three', 'gardeners'),
 ('play', 'croquet'),
 ('Lobster', 'Quadrille'),
 ('ootiful', 'Soo'),
 ('great', 'hurry'),
 ('old', 'fellow'),
 ('trembling', 'voice'),
 ('poor', 'little'),
 ('next', 'witness'),
 ('feet', 'high'),
 ('poor', 'Alice'),
 ('inches', 'high'),
 ('young', 'lady')]

!pip install regex

In [21]:
import nltk

opis="Ala ma kota, kto tam przyszedł"

tc = nltk.classify.textcat.TextCat() 
tc.guess_language(opis)


'pol'

---
# Spacy

In [22]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup VERB dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


In [23]:
import spacy
from spacy.lang.pl.examples import sentences 

nlp = spacy.load("pl_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)


Poczuł przyjemną woń mocnej kawy.


In [24]:
for token in doc:
    print(token.text, token.pos_, token.dep_)

Poczuł VERB ROOT
przyjemną ADJ amod
woń NOUN obj
mocnej ADJ amod
kawy NOUN nmod:arg
. PUNCT punct


In [25]:
doc = nlp(opis)
print(doc.text)

Ala ma kota, kto tam przyszedł


In [26]:
for token in doc:
    print(token.text, token.pos_, token.dep_)

Ala PROPN nsubj
ma VERB ROOT
kota NOUN iobj
, PUNCT punct
kto PRON nsubj
tam ADV advmod
przyszedł VERB acl:relcl


---
# Gensim

## Word2Vec - model wektorowy słów w oparciu o sieci neuronowe (płytkie) - ale mówi się na to Deep Learning

### vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)
---

In [27]:
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data


In [28]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])[0]

('queen', 0.7698541283607483)

In [29]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7698541283607483),
 ('monarch', 0.6843380928039551),
 ('throne', 0.6755736470222473),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534157752991),
 ('prince', 0.6517035365104675),
 ('elizabeth', 0.6464517712593079),
 ('mother', 0.6311717629432678),
 ('emperor', 0.6106470823287964),
 ('wife', 0.6098655462265015)]

---

## Czynniki wpływające na sukces ekstrakcji tematu

1. Jakość wstępnej obróbki
2. Róznorodnośc danych
3. Długo, długo nic ...
4. Dobór algorytmu
5. Parametryzacja algorytmu

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

In [30]:
import warnings
warnings.filterwarnings('ignore')

`import nltk` 

`nltk.download('stopwords')`

`nltk.download('wordnet')`

In [31]:
import gzip
import pandas as pd

df = pd.read_json(
         gzip.open('data/newsgroups.json.gz', 'rt', encoding='utf-8')
)

df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


In [32]:
df.target_names.unique()

array(['rec.autos', 'comp.sys.mac.hardware', 'comp.graphics', 'sci.space',
       'talk.politics.guns', 'sci.med', 'comp.sys.ibm.pc.hardware',
       'comp.os.ms-windows.misc', 'rec.motorcycles', 'talk.religion.misc',
       'misc.forsale', 'alt.atheism', 'sci.electronics', 'comp.windows.x',
       'rec.sport.hockey', 'rec.sport.baseball', 'soc.religion.christian',
       'talk.politics.mideast', 'talk.politics.misc', 'sci.crypt'],
      dtype=object)

In [33]:
example = df['content'][1]
example


"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n"

In [34]:
raw_data = []

for row in df.iterrows():
    raw_data.append(row[1]['content'])

In [35]:
raw_data[1]

"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n"

In [36]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
len(stop_words)

179

In [37]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [38]:
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [39]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer

wn_lemmatizer = WordNetLemmatizer()
    
def lemmatize(word):
    return wn_lemmatizer.lemmatize(word)

In [40]:
lemmatize('automata')

'automaton'

In [41]:
import spacy
spacy.load('en_core_web_sm')
from spacy.lang.en import English
parser = English()

def tokenize(text):
    result = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            result.append('URL')
        elif token.orth_.startswith('@'):
            result.append('SCREEN_NAME')
        else:
            result.append(token.lower_)
    return result

In [42]:
print(tokenize(example))

['from', ':', 'guykuo@carson.u.washington.edu', '(', 'guy', 'kuo', ')', 'subject', ':', 'si', 'clock', 'poll', '-', 'final', 'call', 'summary', ':', 'final', 'call', 'for', 'si', 'clock', 'reports', 'keywords', ':', 'si', ',', 'acceleration', ',', 'clock', ',', 'upgrade', 'article', '-', 'i.d.', ':', 'shelley.1qvfo9innc3s', 'organization', ':', 'university', 'of', 'washington', 'lines', ':', '11', 'nntp', '-', 'posting', '-', 'host', ':', 'URL', 'a', 'fair', 'number', 'of', 'brave', 'souls', 'who', 'upgraded', 'their', 'si', 'clock', 'oscillator', 'have', 'shared', 'their', 'experiences', 'for', 'this', 'poll', '.', 'please', 'send', 'a', 'brief', 'message', 'detailing', 'your', 'experiences', 'with', 'the', 'procedure', '.', 'top', 'speed', 'attained', ',', 'cpu', 'rated', 'speed', ',', 'add', 'on', 'cards', 'and', 'adapters', ',', 'heat', 'sinks', ',', 'hour', 'of', 'usage', 'per', 'day', ',', 'floppy', 'disk', 'functionality', 'with', '800', 'and', '1.4', 'm', 'floppies', 'are', 'es

In [43]:
def preprocessing(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if len(token) < 14]
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [lemmatize(token) for token in tokens]
    return tokens

In [44]:
print(preprocessing(example))

['clock', 'final', 'summary', 'final', 'clock', 'report', 'keywords', 'acceleration', 'clock', 'upgrade', 'article', 'organization', 'university', 'washington', 'line', 'posting', 'number', 'brave', 'soul', 'upgraded', 'clock', 'oscillator', 'shared', 'experience', 'please', 'brief', 'message', 'detailing', 'experience', 'procedure', 'speed', 'attained', 'rated', 'speed', 'card', 'adapter', 'sink', 'usage', 'floppy', 'functionality', 'floppy', 'especially', 'requested', 'summarizing', 'please', 'network', 'knowledge', 'clock', 'upgrade', 'answered', 'thanks']


```python

# Zmieniłem z Markdown na Code - bo to trwa

text_data = [ preprocessing(text) for text in raw_data ]

```

---
## Pikluj co się da!
---

```python

# Zmieniłem z Markdown na Code - bo to trwa

import pickle

pickle.dump(text_data, open('data/text_data.pkl', 'wb'))

```

In [45]:
import pickle

text_data = pickle.load(open('data/text_data.pkl', 'rb'))

In [46]:
print(text_data[1])

['clock', 'final', 'summary', 'final', 'clock', 'report', 'keywords', 'acceleration', 'clock', 'upgrade', 'article', 'organization', 'university', 'washington', 'line', 'posting', 'number', 'brave', 'soul', 'upgraded', 'clock', 'oscillator', 'shared', 'experience', 'please', 'brief', 'message', 'detailing', 'experience', 'procedure', 'speed', 'attained', 'rated', 'speed', 'card', 'adapter', 'sink', 'usage', 'floppy', 'functionality', 'floppy', 'especially', 'requested', 'summarizing', 'please', 'network', 'knowledge', 'clock', 'upgrade', 'answered', 'thanks']


In [47]:
import pickle
from gensim import corpora

dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]

# corpus.MmCorpus.serialize('data\corpus.mm', (x for x in corpus if len(x) > 0))
# pickle.dump(corpus, open('data\corpus.pkl', 'wb'))
# dictionary.save('data\dictionary.gensim')

In [48]:
n=0

for w in corpus:
    n+=1
    if len(w)==0:
        print(n)

In [49]:
import pickle

corpus = pickle.load(open('data\corpus.pkl', 'rb'))
dictionary = pickle.load(open('data\dictionary.gensim', 'rb'))

In [50]:
df.target_names.unique()

array(['rec.autos', 'comp.sys.mac.hardware', 'comp.graphics', 'sci.space',
       'talk.politics.guns', 'sci.med', 'comp.sys.ibm.pc.hardware',
       'comp.os.ms-windows.misc', 'rec.motorcycles', 'talk.religion.misc',
       'misc.forsale', 'alt.atheism', 'sci.electronics', 'comp.windows.x',
       'rec.sport.hockey', 'rec.sport.baseball', 'soc.religion.christian',
       'talk.politics.mideast', 'talk.politics.misc', 'sci.crypt'],
      dtype=object)

In [51]:
import gensim
NUM_TOPICS = 5

ldamodel = gensim.models.ldamulticore.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=5)
ldamodel.save('data/model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.006*"would" + 0.006*"people" + 0.005*"organization" + 0.005*"article"')
(1, '0.007*"would" + 0.007*"people" + 0.006*"line" + 0.006*"christian"')
(2, '0.010*"would" + 0.009*"writes" + 0.009*"organization" + 0.009*"line"')
(3, '0.014*"line" + 0.014*"organization" + 0.010*"would" + 0.010*"writes"')
(4, '0.017*"line" + 0.016*"organization" + 0.010*"window" + 0.009*"drive"')


In [52]:
topics = ldamodel.print_topics(num_words=8)
for topic in topics:
    print(topic)

(0, '0.006*"would" + 0.006*"people" + 0.005*"organization" + 0.005*"article" + 0.005*"line" + 0.005*"space" + 0.005*"state" + 0.004*"writes"')
(1, '0.007*"would" + 0.007*"people" + 0.006*"line" + 0.006*"christian" + 0.005*"organization" + 0.005*"jesus" + 0.004*"think" + 0.004*"writes"')
(2, '0.010*"would" + 0.009*"writes" + 0.009*"organization" + 0.009*"line" + 0.008*"article" + 0.007*"people" + 0.006*"think" + 0.005*"university"')
(3, '0.014*"line" + 0.014*"organization" + 0.010*"would" + 0.010*"writes" + 0.009*"article" + 0.008*"posting" + 0.007*"system" + 0.005*"could"')
(4, '0.017*"line" + 0.016*"organization" + 0.010*"window" + 0.009*"drive" + 0.008*"university" + 0.008*"posting" + 0.007*"problem" + 0.006*"system"')


In [None]:
df.target_names.unique()

```python

# Zmień z Markdown na Code

import gensim
NUM_TOPICS = 20

ldamodel = gensim.models.ldamulticore.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('data/model20.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)
    
```

In [53]:
import gensim

ldamodel = gensim.models.ldamodel.LdaModel.load('data/model20.gensim')

In [54]:
topics = ldamodel.print_topics(num_words=8)
for topic in topics:
    print(topic)
    

(0, '0.034*"entry" + 0.013*"program" + 0.011*"rule" + 0.010*"section" + 0.010*"outlet" + 0.007*"neutral" + 0.007*"build" + 0.007*"output"')
(1, '0.026*"window" + 0.016*"program" + 0.013*"image" + 0.010*"available" + 0.010*"version" + 0.010*"server" + 0.010*"file" + 0.009*"application"')
(2, '0.016*"circuit" + 0.015*"power" + 0.013*"point" + 0.012*"ground" + 0.011*"radio" + 0.011*"input" + 0.011*"battery" + 0.010*"signal"')
(3, '0.020*"armenian" + 0.012*"turkish" + 0.012*"people" + 0.008*"greek" + 0.007*"woman" + 0.006*"turkey" + 0.006*"turk" + 0.006*"armenia"')
(4, '0.035*"president" + 0.024*"clinton" + 0.015*"koresh" + 0.011*"going" + 0.009*"today" + 0.008*"house" + 0.008*"package" + 0.008*"press"')
(5, '0.017*"disease" + 0.012*"patient" + 0.011*"cause" + 0.010*"doctor" + 0.009*"medical" + 0.008*"safety" + 0.008*"effect" + 0.007*"thing"')
(6, '0.028*"space" + 0.015*"SCREEN_NAME" + 0.009*"center" + 0.008*"earth" + 0.008*"orbit" + 0.008*"april" + 0.008*"research" + 0.007*"satellite"')
(

In [55]:
df.target_names.unique()

array(['rec.autos', 'comp.sys.mac.hardware', 'comp.graphics', 'sci.space',
       'talk.politics.guns', 'sci.med', 'comp.sys.ibm.pc.hardware',
       'comp.os.ms-windows.misc', 'rec.motorcycles', 'talk.religion.misc',
       'misc.forsale', 'alt.atheism', 'sci.electronics', 'comp.windows.x',
       'rec.sport.hockey', 'rec.sport.baseball', 'soc.religion.christian',
       'talk.politics.mideast', 'talk.politics.misc', 'sci.crypt'],
      dtype=object)

In [56]:
new_doc = 'Data science is a new technology that uses statistics and computer science'
new_doc = preprocessing(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)

print(ldamodel.get_document_topics(new_doc_bow))

[(15, 0.84159905)]


In [57]:
rez = ldamodel.get_document_topics(new_doc_bow)

In [58]:
rez.sort(key=lambda x: x[1], reverse=True)
rez

[(15, 0.84159905)]

In [59]:
example

"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n"

In [60]:
example_doc = preprocessing(example)
example_bow = dictionary.doc2bow(example_doc)
rez = ldamodel.get_document_topics(example_bow)
rez.sort(key=lambda x: x[1], reverse=True)
print(rez);

[(15, 0.38689086), (10, 0.30832854), (1, 0.09367774), (17, 0.06815313), (9, 0.06580259), (19, 0.040835872), (2, 0.02355116)]
