# Виконання


## Використання тексту austen-persuasion.txt з корпусу gutenberg бібліотеки nltk та виведення ключових біграм.


### Завантажимо текст austen-persuasion.txt.


In [13]:
import nltk
import re
from nltk.corpus import gutenberg
import numpy as np
name = 'austen-persuasion.txt'
text = [' '.join(sent) for sent in gutenberg.sents(name)]
text[:30]

['[ Persuasion by Jane Austen 1818 ]',
 'Chapter 1',
 'Sir Walter Elliot , of Kellynch Hall , in Somersetshire , was a man who , for his own amusement , never took up any book but the Baronetage ; there he found occupation for an idle hour , and consolation in a distressed one ; there his faculties were roused into admiration and respect , by contemplating the limited remnant of the earliest patents ; there any unwelcome sensations , arising from domestic affairs changed naturally into pity and contempt as he turned over the almost endless creations of the last century ; and there , if every other leaf were powerless , he could read his own history with an interest which never failed .',
 'This was the page at which the favourite volume always opened :',
 '" ELLIOT OF KELLYNCH HALL .',
 '" Walter Elliot , born March 1 , 1760 , married , July 15 , 1784 , Elizabeth , daughter of James Stevenson , Esq .',
 'of South Park , in the county of Gloucester , by which lady ( who died 1800 ) he h

_Зчитування тексту_


### Визначимо стоп-слова англійської мови.


In [14]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

_Стоп-слова_


### Визначимо функцію, що виконує попередню обробку документу. Застосуємо декоратор np.vectorize для того, щоб функція могла працювати з корпусами.


In [15]:
@np.vectorize
def preproc_doc(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    tokens = wpt.tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    doc = ' '.join(filtered_tokens)
    return doc


text = preproc_doc(text)
text

array(['persuasion jane austen', 'chapter',
       'sir walter elliot kellynch hall somersetshire man amusement never took book baronetage found occupation idle hour consolation distressed one faculties roused admiration respect contemplating limited remnant earliest patents unwelcome sensations arising domestic affairs changed naturally pity contempt turned almost endless creations last century every leaf powerless could read history interest never failed',
       ...,
       'profession could ever make friends wish tenderness less dread future war could dim sunshine',
       'gloried sailor wife must pay tax quick alarm belonging profession possible distinguished domestic virtues national importance',
       'finis'], dtype='<U662')

_Обробка документів_


### Завантажимо функції для пошуку сполучень та визначення тих, що зустрічаються найчастіше, або тих, що мають найвищі значення інших показників, наприклад, поточкової взаємної інформації.


In [16]:
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import BigramAssocMeasures
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(
    [item.split() for item in text])
finder.nbest(bigram_measures.raw_freq, 10)

[('captain', 'wentworth'),
 ('mr', 'elliot'),
 ('lady', 'russell'),
 ('sir', 'walter'),
 ('mrs', 'clay'),
 ('mrs', 'musgrove'),
 ('mrs', 'smith'),
 ('captain', 'benwick'),
 ('miss', 'elliot'),
 ('mrs', 'croft')]

_Ключові біграми_


## Застосування прихованого семантичного індексування бібліотеки Gensim для моделювання тем.


### Для початку імпортуємо модулі та зчитаємо файл.


In [17]:
import pandas as pd
df = pd.read_csv('bbc-news-data.csv', sep='\t')
df

Unnamed: 0,category,filename,title,content
0,business,001.txt,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...
1,business,002.txt,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...
2,business,003.txt,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...
3,business,004.txt,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...
4,business,005.txt,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...
...,...,...,...,...
2220,tech,397.txt,BT program to beat dialler scams,BT is introducing two initiatives to help bea...
2221,tech,398.txt,Spam e-mails tempt net shoppers,Computer users across the world continue to i...
2222,tech,399.txt,Be careful how you code,A new European directive could put software w...
2223,tech,400.txt,US cyber security chief resigns,The man making sure US computer networks are ...


_Зчитування файлу_


### Виділимо лише колонку "content"


In [18]:
text = df['content'].values
text

array([' Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.  The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.  Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL\'s underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL\'s existing customers fo

_Виділення тексту_


### Нормалізуємо текст.


In [19]:
text = preproc_doc(text)
text

array(['quarterly profits us media giant timewarner jumped bn three months december yearearlier firm one biggest investors google benefited sales highspeed internet connections higher advert sales timewarner said fourth quarter sales rose bn bn profits buoyed oneoff gains offset profit dip warner bros less users aol time warner said friday owns searchengine google internet business aol mixed fortunes lost subscribers fourth quarter profits lower preceding three quarters however company said aols underlying profit exceptional items rose back stronger internet advertising revenues hopes increase subscribers offering online service free timewarner internet customers try sign aols existing customers highspeed broadband timewarner also restate results following probe us securities exchange commission sec close concluding time warners fourth quarter profits slightly better analysts expectations film division saw profits slump helped boxoffice flops alexander catwoman sharp contrast yearearli

_Нормалізація тексту_


### Розділимо текст на матрицю слів.


In [20]:
sentences = [sent.split() for sent in text]
sentences[:10]

[['quarterly',
  'profits',
  'us',
  'media',
  'giant',
  'timewarner',
  'jumped',
  'bn',
  'three',
  'months',
  'december',
  'yearearlier',
  'firm',
  'one',
  'biggest',
  'investors',
  'google',
  'benefited',
  'sales',
  'highspeed',
  'internet',
  'connections',
  'higher',
  'advert',
  'sales',
  'timewarner',
  'said',
  'fourth',
  'quarter',
  'sales',
  'rose',
  'bn',
  'bn',
  'profits',
  'buoyed',
  'oneoff',
  'gains',
  'offset',
  'profit',
  'dip',
  'warner',
  'bros',
  'less',
  'users',
  'aol',
  'time',
  'warner',
  'said',
  'friday',
  'owns',
  'searchengine',
  'google',
  'internet',
  'business',
  'aol',
  'mixed',
  'fortunes',
  'lost',
  'subscribers',
  'fourth',
  'quarter',
  'profits',
  'lower',
  'preceding',
  'three',
  'quarters',
  'however',
  'company',
  'said',
  'aols',
  'underlying',
  'profit',
  'exceptional',
  'items',
  'rose',
  'back',
  'stronger',
  'internet',
  'advertising',
  'revenues',
  'hopes',
  'increase

_Речення_


### Виділяємо біграми для всіх документів та створюємо словник.


In [21]:
from gensim.models.phrases import Phrases, Phraser, ENGLISH_CONNECTOR_WORDS
bigram = Phrases(sentences, min_count=20, threshold=20,
                 connector_words=ENGLISH_CONNECTOR_WORDS)
bigram_model = Phraser(bigram)

_Модель_


### Виділяємо біграми для всіх документів та створюємо словник.


In [22]:
from gensim.corpora import Dictionary
norm_corpus_bigrams = [bigram_model[sent] for sent in sentences]
dictionary = Dictionary(norm_corpus_bigrams)
norm_corpus_bigrams[:20]

[['quarterly',
  'profits',
  'us',
  'media',
  'giant',
  'timewarner',
  'jumped',
  'bn',
  'three_months',
  'december',
  'yearearlier',
  'firm',
  'one',
  'biggest',
  'investors',
  'google',
  'benefited',
  'sales',
  'highspeed',
  'internet',
  'connections',
  'higher',
  'advert',
  'sales',
  'timewarner',
  'said',
  'fourth_quarter',
  'sales',
  'rose',
  'bn_bn',
  'profits',
  'buoyed',
  'oneoff',
  'gains',
  'offset',
  'profit',
  'dip',
  'warner',
  'bros',
  'less',
  'users',
  'aol',
  'time',
  'warner',
  'said',
  'friday',
  'owns',
  'searchengine',
  'google',
  'internet',
  'business',
  'aol',
  'mixed',
  'fortunes',
  'lost',
  'subscribers',
  'fourth_quarter',
  'profits',
  'lower',
  'preceding',
  'three',
  'quarters',
  'however',
  'company',
  'said',
  'aols',
  'underlying',
  'profit',
  'exceptional',
  'items',
  'rose',
  'back',
  'stronger',
  'internet',
  'advertising',
  'revenues',
  'hopes',
  'increase',
  'subscribers',


_Біграми документів_


### Зменшимо об'єм словника через велику кількість унікальних рідкісних слів. та створюємо модель сумки слів.


In [23]:
dictionary.filter_extremes(no_below=20, no_above=0.6)
bow_corpus = [dictionary.doc2bow(text) for text in norm_corpus_bigrams]
bow_corpus[:20]

[[(0, 2),
  (1, 2),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 3),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 2),
  (24, 1),
  (25, 1),
  (26, 2),
  (27, 2),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 2),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 2),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 3),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 2),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 2),
  (54, 2),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 4),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1),
  (82, 1),
  (83, 1),
  (84, 2),
  (85, 1),
  (86, 1),
  (87, 4),
  (88, 5),
  (89, 1),
  (90, 1),
  (91, 1)

_Сумка слів_


### Застосуємо приховане семантичне індексування.


In [24]:
from gensim.models import LsiModel
total_topics = 10
lsi_bow = LsiModel(bow_corpus, id2word=dictionary,
                   num_topics=total_topics,
                   onepass=True, chunksize=10000,
                   power_iters=1000)

_Приховане семантичне індексування_


### Переглянемо основні теми.


In [25]:
for topic_id, topic in lsi_bow.print_topics(num_topics=10, num_words=20):
    print('Topic #'+str(topic_id+1)+':')
    print(topic)

Topic #1:
0.266*"would" + 0.254*"people" + 0.198*"mr" + 0.183*"also" + 0.172*"one" + 0.171*"new" + 0.161*"us" + 0.153*"could" + 0.125*"music" + 0.122*"government" + 0.117*"like" + 0.107*"time" + 0.100*"get" + 0.096*"many" + 0.093*"first" + 0.090*"make" + 0.090*"year" + 0.085*"two" + 0.083*"uk" + 0.082*"way"
Topic #2:
-0.447*"music" + 0.262*"mr" + 0.228*"would" + -0.216*"best" + 0.207*"government" + -0.160*"game" + -0.137*"song" + 0.131*"labour" + -0.114*"awards" + -0.107*"games" + -0.092*"win" + -0.091*"award" + -0.090*"good" + -0.088*"like" + -0.088*"film" + -0.088*"think" + -0.087*"last" + -0.086*"play" + 0.082*"bn" + 0.082*"plans"
Topic #3:
0.337*"music" + 0.303*"people" + -0.210*"game" + 0.158*"technology" + -0.134*"win" + -0.133*"england" + -0.125*"wales" + 0.120*"users" + 0.120*"mobile" + -0.116*"first" + -0.104*"two" + 0.102*"services" + 0.102*"use" + 0.101*"digital" + -0.098*"back" + 0.094*"net" + -0.093*"best" + 0.089*"tv" + 0.088*"software" + 0.087*"broadband"
Topic #4:
0.425

_Основні теми_
