# Введение в обработку текста на естественном языке

Материалы:
* Макрушин С.В. Лекция 9: Введение в обработку текста на естественном языке\
* https://realpython.com/nltk-nlp-python/
* https://scikit-learn.org/stable/modules/feature_extraction.html

## Задачи для совместного разбора

In [187]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.metrics.distance import edit_distance
from nltk import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
from scipy.spatial import distance
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import pymorphy2
import random

In [102]:
import nltk
nltk.download('wordnet')
nltk.download("stopwords")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Артем\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Артем\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [34]:
path1 = "D:/FinUniver/Технологии обработки больших данных/Семинары/09_string_2/09_string_2_data/"
path2 = "D:/FinUniver/Технологии обработки больших данных/Семинары/08_string/data/"

1. Считайте слова из файла `litw-win.txt` и запишите их в список `words`. В заданном предложении исправьте все опечатки, заменив слова с опечатками на ближайшие (в смысле расстояния Левенштейна) к ним слова из списка `words`. Считайте, что в слове есть опечатка, если данное слово не содержится в списке `words`. 

In [4]:
s1 = 'ПИ19-4'
s2 = 'ПИ19-3'
edit_distance(s1, s2)

1

In [6]:
with open(path1 + 'litw-win.txt') as f:
    words = [line.strip().split()[-1] for line in f]
words[-5:]

['высокопревосходительства',
 'попреблагорассмотрительст',
 'попреблагорассмотрительствующемуся',
 'убегающих',
 'уменьшившейся']

In [7]:
text = '''с велечайшим усилием выбравшись из потока убегающих людей Кутузов со свитой уменьшевшейся вдвое поехал на звуки выстрелов русских орудий'''

In [8]:
word = 'велечайшим'

In [9]:
min(words, key=lambda k: edit_distance(k, word))

'величайшим'

2. Разбейте текст из формулировки задания 1 на слова; проведите стемминг и лемматизацию слов.

In [16]:
stemmer = SnowballStemmer('russian')
word = 'попреблагорассмотрительствующимся'
stemmer.stem(word)

'попреблагорассмотрительств'

In [19]:
morph = pymorphy2.MorphAnalyzer()
morph.parse(word)[0].normalized.word

'попреблагорассмотрительствующийся'

3. Преобразуйте предложения из формулировки задания 1 в векторы при помощи `CountVectorizer`.

In [26]:
text = '''Считайте слова из файла `litw-win.txt` и запишите их в список `words`. В заданном предложении исправьте все опечатки, заменив слова с опечатками на ближайшие (в смысле расстояния Левенштейна) к ним слова из списка `words`. Считайте, что в слове есть опечатка, если данное слово не содержится в списке `words`.'''
sents = sent_tokenize(text)
sents

['Считайте слова из файла `litw-win.txt` и запишите их в список `words`.',
 'В заданном предложении исправьте все опечатки, заменив слова с опечатками на ближайшие (в смысле расстояния Левенштейна) к ним слова из списка `words`.',
 'Считайте, что в слове есть опечатка, если данное слово не содержится в списке `words`.']

In [29]:
cv = CountVectorizer()
cv.fit(sents)
sents_cv = cv.transform(sents).toarray()
sents_cv

array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1,
        1, 1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
        0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1]], dtype=int64)

In [30]:
sents_cv.shape

(3, 35)

In [32]:
cv.vocabulary_

{'считайте': 32,
 'слова': 24,
 'из': 12,
 'файла': 33,
 'litw': 0,
 'win': 2,
 'txt': 1,
 'запишите': 11,
 'их': 14,
 'список': 31,
 'words': 3,
 'заданном': 9,
 'предложении': 22,
 'исправьте': 13,
 'все': 5,
 'опечатки': 21,
 'заменив': 10,
 'опечатками': 20,
 'на': 16,
 'ближайшие': 4,
 'смысле': 27,
 'расстояния': 23,
 'левенштейна': 15,
 'ним': 18,
 'списка': 29,
 'что': 34,
 'слове': 25,
 'есть': 8,
 'опечатка': 19,
 'если': 7,
 'данное': 6,
 'слово': 26,
 'не': 17,
 'содержится': 28,
 'списке': 30}

## Лабораторная работа 9

### Расстояние редактирования

1.1 Загрузите предобработанные описания рецептов из файла `preprocessed_descriptions.csv`. Получите набор уникальных слов `words`, содержащихся в текстах описаний рецептов (воспользуйтесь `word_tokenize` из `nltk`). 

In [51]:
recipes_descriptions = pd.read_csv(path2 + 'preprocessed_descriptions.csv', sep=';')
recipes_descriptions

Unnamed: 0,name,description
0,george s at the cove black bean soup,an original recipe created by chef scott meska...
1,healthy for them yogurt popsicles,my children and their friends ask for my homem...
2,i can t believe it s spinach,these were so go it surprised even me
3,italian gut busters,my sisterinlaw made these for us at a family g...
4,love is in the air beef fondue sauces,i think a fondue is a very romantic casual din...
5,mennonite corn fritters,ok my heritage has been revealed these are s...
6,open sesame noodles,this is a very versatile and widely enjoyed pa...
7,say what banana sandwich,you just have to try it to believe it
8,1 in canada chocolate chip cookies,this is the recipe that we use at my school ca...
9,412 broccoli casserole,since there are already 411 recipes for brocco...


In [53]:
words = set()

for index, row in recipes_descriptions.iterrows():
    try:
        for word in word_tokenize(row[1]):
            words.add(word)
    except TypeError:
        continue
words

{'wurstladende',
 'lead',
 'casein',
 'braai',
 'nutsfruits',
 'linzie',
 'antiozidants',
 'litte',
 'sausageherd',
 'sauceginger',
 'kirk',
 'naidre',
 'slimmers',
 'andalusia',
 'bodies',
 'baileys',
 'suggestionsdrizzle',
 'hulled',
 'pita',
 'webgaza',
 'halfbatch',
 'wrongnote',
 'mutigrain',
 'riverfront',
 'munitap',
 'housesi',
 'parsleybulgur',
 'recipei',
 'indians',
 'consultant',
 'sayssimple',
 'elsewhere',
 'pot',
 'baltimore',
 'copy',
 'admittedly',
 'balmy',
 'ebay',
 'ninja',
 'imposter',
 'libbie',
 'guitarist',
 'on',
 'molinillo',
 'tendency',
 'spreads',
 'emerils',
 'extramoist',
 'lunchin',
 'poo',
 'wwwrwoodcom',
 'south',
 'sorted',
 'lucre',
 'hertzog',
 'minnetonka',
 'likedand',
 'bonne',
 '5ingredient',
 'pacon',
 'stricly',
 'everything',
 'tastemore',
 'mans',
 'teifi',
 'savvy',
 'zurie',
 'juicier',
 'boiled',
 '124',
 'minifie',
 'sweettangy',
 'vast',
 'squeeze',
 'removedbut',
 'crockpotting',
 'oregano',
 'ste',
 'prieta',
 'tamale',
 'tortelloni',

1.2 Сгенерируйте 5 пар случайно выбранных слов и посчитайте между ними расстояние редактирования.

In [64]:
group1 = random.sample(words, 5)
group2 = random.sample(words, 5)

In [65]:
for i in range(len(group1)):
    print(edit_distance(group1[i], group2[i]))

5
7
11
11
13


1.3 Напишите функцию, которая для заданного слова `word` возвращает `k` ближайших к нему слов из списка `words` (близость слов измеряется с помощью расстояния Левенштейна)

In [80]:
def close_word(word, k, words):
    words_tuple = list()
    for wor in words:
        words_tuple.append(tuple([edit_distance(word, wor), wor]))
    words_tuple.sort(key = lambda x: x[0])
    
    result = list()
    for wor in words_tuple[:k]:
        result.append(wor[1])
    
    return result

In [81]:
close_word("delicious", 4, words)

['delicious', 'delicioius', 'delicius', 'deliciousi']

### Стемминг, лемматизация

2.1 На основе результатов 1.1 создайте `pd.DataFrame` со столбцами: 
    * word
    * stemmed_word 
    * normalized_word 

Столбец `word` укажите в качестве индекса. 

Для стемминга воспользуйтесь `SnowballStemmer`, для лемматизации слов - `WordNetLemmatizer`. Сравните результаты стемминга и лемматизации.

In [88]:
stem_norm_words = pd.DataFrame(columns=['word', 'stemmed_word', 'normalized_word'])
stem_norm_words['word'] = list(words)
stem_norm_words = stem_norm_words.set_index('word')
stem_norm_words

Unnamed: 0_level_0,stemmed_word,normalized_word
word,Unnamed: 1_level_1,Unnamed: 2_level_1
wurstladende,,
lead,,
casein,,
braai,,
nutsfruits,,
linzie,,
antiozidants,,
litte,,
sausageherd,,
sauceginger,,


In [97]:
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [103]:
for index in stem_norm_words.index:
    stem_norm_words['stemmed_word'][index] = stemmer.stem(index)
    stem_norm_words['normalized_word'][index] = lemmatizer.lemmatize(index)

In [104]:
stem_norm_words

Unnamed: 0_level_0,stemmed_word,normalized_word
word,Unnamed: 1_level_1,Unnamed: 2_level_1
wurstladende,wurstladend,wurstladende
lead,lead,lead
casein,casein,casein
braai,braai,braai
nutsfruits,nutsfruit,nutsfruits
linzie,linzi,linzie
antiozidants,antiozid,antiozidants
litte,litt,litte
sausageherd,sausageherd,sausageherd
sauceginger,sauceging,sauceginger


In [105]:
for index, row in stem_norm_words.iterrows():
    print(edit_distance(row[0], row[1]))

1
0
0
0
1
1
4
1
0
2
0
1
1
0
1
0
1
2
0
0
0
1
0
0
0
0
0
0
0
3
1
1
0
1
1
5
1
0
0
2
1
0
0
0
1
0
1
0
0
0
0
0
2
1
0
0
0
1
3
0
2
3
1
0
0
1
1
0
2
0
1
1
0
1
0
4
0
0
0
1
0
0
2
2
0
3
1
0
1
2
0
1
0
0
1
2
3
0
0
0
3
0
3
0
2
2
0
0
0
0
2
2
2
0
2
4
0
0
0
0
0
0
3
3
0
0
2
4
0
0
1
0
1
1
2
3
0
2
0
0
0
0
0
3
5
1
1
0
0
0
2
0
0
0
2
0
1
5
2
1
0
0
0
4
0
2
0
1
0
0
0
0
1
2
1
0
1
0
1
0
1
0
0
1
0
0
0
0
0
3
0
0
1
0
0
0
1
0
3
3
0
0
0
1
0
0
0
0
3
2
1
0
4
0
1
0
1
0
0
1
0
0
1
0
2
0
0
0
0
0
1
0
0
0
0
1
3
1
0
1
4
0
1
3
0
0
0
1
0
1
0
1
1
1
0
0
0
4
0
2
0
0
1
0
0
1
2
1
0
0
0
1
0
1
0
0
0
0
0
1
1
0
1
0
1
0
3
0
0
1
2
2
1
5
0
1
3
0
1
0
2
0
5
4
1
3
3
0
0
1
3
0
0
0
0
1
2
1
0
0
0
0
0
2
0
0
0
0
0
4
0
5
0
0
3
5
1
0
3
0
0
0
0
2
0
0
0
5
0
0
1
0
0
1
0
3
0
0
0
0
0
0
0
0
2
0
0
1
0
0
0
3
0
0
0
0
4
0
0
0
0
0
4
0
0
2
1
3
3
1
3
0
1
2
0
0
3
5
2
1
3
3
1
1
0
1
0
2
0
0
0
0
0
1
3
0
0
0
0
1
1
3
0
1
6
3
3
1
3
0
4
0
0
3
2
0
0
0
0
0
3
2
0
1
0
1
0
2
0
3
0
1
1
1
0
0
0
2
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
2
2
1
1
3
1
0
0
0
0
0
1
0
3
0
0
0
0
1
0
0


0
1
0
0
1
0
0
3
0
0
1
0
0
1
3
0
3
0
0
0
0
5
2
0
5
3
0
3
1
0
3
0
1
0
0
3
0
0
0
0
1
0
1
6
0
0
0
2
2
1
1
0
5
2
0
1
0
0
0
3
2
0
0
0
0
0
3
0
2
0
0
0
1
0
1
1
4
1
0
0
0
0
1
3
0
0
4
1
2
0
0
0
0
1
0
0
2
1
3
0
0
0
0
0
0
1
0
0
0
0
0
2
0
1
0
1
1
1
1
0
2
0
1
0
0
0
1
0
0
0
4
0
2
0
0
1
3
0
0
1
3
0
1
3
1
0
0
1
3
0
1
0
0
2
1
0
1
0
1
0
0
0
0
0
0
3
1
1
0
3
0
2
0
0
0
0
3
2
0
2
0
0
0
1
1
1
0
2
0
0
0
0
3
2
3
0
0
0
0
0
0
0
1
2
0
2
1
0
4
0
0
0
0
0
0
0
0
0
0
0
0
2
2
2
0
1
1
0
1
0
0
0
0
0
2
1
2
0
0
0
2
0
1
0
7
0
0
0
0
0
0
4
2
0
0
0
0
0
0
0
1
0
0
0
0
3
1
1
0
0
3
0
0
1
2
0
0
1
0
0
0
0
0
2
0
0
2
1
1
0
0
2
5
2
0
0
0
0
0
0
2
0
1
0
0
1
1
1
0
0
2
1
0
3
3
0
0
0
0
2
7
0
0
2
0
3
2
1
4
0
0
3
1
0
1
0
2
0
2
0
2
0
4
0
0
2
1
0
0
5
1
0
0
1
0
0
1
1
2
2
0
3
4
0
0
0
0
0
2
1
0
0
1
0
0
3
1
0
1
0
1
0
0
0
0
1
0
1
0
0
0
0
2
0
1
0
0
0
0
2
3
1
0
0
4
0
1
0
0
1
0
1
1
4
0
0
0
0
2
0
0
2
0
2
1
4
0
0
3
0
2
0
1
2
2
2
5
2
0
2
0
1
0
2
0
0
0
2
0
0
0
0
0
0
1
0
0
3
0
0
1
1
3
1
0
0
0
2
5
0
5
0
0
1
0
0
0
1
0
1
3
2
0
0
0
1
0
0
0
0
0
0
0
2
0
0
1
3
2
1


0
0
1
0
2
2
0
4
4
0
0
3
0
3
4
1
0
4
0
0
0
0
0
1
1
0
0
0
0
0
0
0
1
0
1
5
0
2
1
2
0
0
1
0
2
0
1
0
3
1
0
5
0
0
0
1
4
1
0
0
0
0
0
0
2
0
0
0
0
5
0
0
0
1
0
1
0
1
4
0
0
2
1
0
0
0
0
0
0
0
3
0
0
1
1
0
0
1
1
1
0
0
0
0
4
3
2
1
0
1
0
0
0
2
1
0
2
4
1
1
0
0
3
2
0
1
0
1
0
0
3
0
1
4
1
0
1
3
2
1
1
0
1
1
0
0
0
1
2
0
3
3
0
0
2
0
1
0
1
0
0
1
0
0
0
1
3
0
0
2
3
3
2
0
2
0
1
1
0
0
2
2
0
1
1
0
0
0
5
1
2
0
0
0
0
2
1
2
0
0
2
0
4
5
0
0
0
0
3
0
0
0
2
4
0
0
0
0
1
0
0
3
1
1
0
1
0
0
0
0
4
0
0
1
2
0
0
1
0
0
1
2
5
0
0
1
1
1
0
0
0
0
0
0
1
4
1
2
0
3
0
1
0
3
1
2
0
0
0
1
0
0
0
2
1
0
2
0
0
1
1
0
2
0
0
2
0
1
0
4
0
2
0
1
0
0
2
0
2
0
0
1
0
2
1
1
2
0
0
1
0
2
0
2
0
0
3
1
3
0
0
3
3
0
3
1
0
0
0
3
1
0
0
1
1
0
0
0
0
0
1
0
2
0
1
0
0
0
1
0
2
0
4
0
5
0
1
1
3
3
5
0
0
1
0
0
0
0
0
1
0
0
0
0
1
4
3
2
3
0
2
2
1
4
0
0
0
0
1
1
0
0
0
1
2
2
0
0
2
0
0
2
1
0
0
3
0
1
0
2
2
0
2
0
0
1
0
3
1
0
0
0
3
0
1
0
3
0
0
0
3
0
0
0
3
1
0
0
0
2
0
0
1
2
2
0
0
0
0
3
0
0
1
1
3
0
0
0
1
1
0
1
1
0
0
0
2
0
1
0
1
3
5
0
1
1
0
0
0
0
1
0
0
0
4
0
0
1
0
1
1
0
1
0
5
2
1
0
0
1


1
0
0
0
0
1
2
3
3
0
0
1
0
0
0
5
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
2
1
0
0
2
0
0
3
0
2
0
1
1
2
0
0
3
0
1
0
2
0
0
3
1
0
2
2
2
0
0
0
1
0
0
3
0
2
3
0
2
1
0
0
0
0
1
0
0
0
0
2
0
4
2
2
1
0
0
1
3
0
1
0
0
1
1
3
1
1
0
4
0
0
0
3
0
1
0
0
0
0
2
0
0
0
1
1
1
0
2
1
1
5
0
3
1
0
0
1
1
0
4
4
2
3
0
0
0
1
2
0
1
2
1
0
0
0
1
0
0
1
0
0
0
0
0
2
3
2
2
3
0
0
0
0
0
0
0
0
0
3
0
0
0
1
0
0
0
0
3
3
0
0
0
2
4
0
1
5
0
1
0
3
1
1
1
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
3
0
0
1
1
0
1
1
0
1
0
2
0
0
2
3
3
1
0
0
0
3
3
0
3
1
1
0
0
3
1
0
0
0
1
1
3
0
0
0
2
1
0
0
0
1
1
1
0
0
0
3
0
3
0
1
0
0
0
0
0
2
1
0
1
1
0
1
0
0
0
0
3
2
0
0
0
2
0
0
0
0
0
0
1
0
3
1
0
0
0
0
3
1
0
1
4
1
1
0
1
0
2
1
0
0
0
0
1
1
1
0
0
0
0
0
1
1
0
0
1
1
1
1
3
0
0
0
0
0
3
3
3
1
4
3
0
0
0
2
0
0
1
1
1
0
3
2
0
2
5
2
0
0
0
0
0
3
0
1
0
0
0
0
0
0
0
0
1
5
0
2
0
1
2
3
0
0
0
0
5
4
0
1
4
0
3
3
0
2
0
0
1
2
0
1
0
3
0
0
0
0
2
2
0
1
2
0
2
2
0
0
1
0
3
1
0
0
0
1
1
0
0
0
0
0
3
3
0
0
0
0
0
2
0
1
0
0
1
0
0
1
2
1
2
0
0
0
2
0
0
0
1
0
2
0
0
0
2
0
0
0
0
0
0
1
0
1
2
2
0
1
1
4
0
0
2
1
0
2
2
0


0
1
0
2
1
3
0
2
0
5
0
3
0
0
0
0
0
0
2
0
0
0
0
0
1
0
2
0
0
3
3
0
0
1
1
0
1
3
3
0
0
2
1
1
3
0
0
3
0
3
1
2
0
0
4
2
0
2
0
0
3
0
1
2
2
0
2
2
0
0
0
0
4
0
1
4
0
1
0
0
3
0
3
0
0
1
0
0
1
0
1
1
0
0
0
3
0
1
3
1
0
0
0
1
0
0
0
0
0
3
0
0
0
1
1
1
0
0
1
3
0
3
0
4
1
1
0
3
0
1
1
4
0
1
1
4
2
1
3
1
0
0
1
3
0
0
0
0
1
0
5
0
0
0
1
3
3
2
0
3
0
1
1
0
1
4
2
0
5
1
1
0
2
0
0
3
1
2
0
0
1
0
0
0
0
1
1
0
3
0
0
0
0
3
1
0
1
0
1
0
1
0
1
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
1
2
2
1
1
3
0
0
0
3
1
0
0
3
3
0
0
4
3
0
0
0
0
1
1
0
2
1
2
0
2
2
1
0
0
0
0
1
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
3
0
1
0
0
3
0
0
0
1
0
0
5
0
0
0
0
0
0
0
2
1
1
3
0
1
0
0
0
0
0
0
0
0
0
1
3
0
1
0
5
2
0
0
1
0
3
1
0
1
0
0
0
0
3
0
0
0
0
0
0
0
1
1
0
4
3
3
0
0
2
1
1
2
1
0
0
0
0
0
1
0
0
0
2
0
1
0
1
1
1
0
0
2
1
0
0
0
2
2
0
2
3
2
3
0
2
2
3
3
0
2
0
0
0
2
0
0
0
3
0
2
0
0
0
0
0
1
0
1
2
1
1
3
1
1
0
0
0
1
0
0
3
0
0
2
0
0
0
0
3
3
1
0
1
0
0
5
2
0
2
1
4
0
0
0
0
0
1
0
3
4
0
1
1
1
1
0
0
0
1
0
0
0
2
1
1
2
1
0
0
1
2
2
3
3
1
0
0
1
3
0
0
0
0
4
3
3
0
0
0
2
0
0
2
0
2
1
2
0
0
0
1
0
1


0
0
0
1
0
1
1
1
0
0
1
0
2
5
0
0
0
2
0
0
0
3
0
0
1
1
0
0
0
0
3
0
0
0
1
0
0
0
0
1
0
1
0
1
1
0
2
2
1
0
0
0
5
0
0
1
2
1
0
2
1
0
0
2
0
0
2
0
0
0
1
0
1
1
2
3
0
3
0
1
2
0
2
0
1
2
0
3
0
0
1
2
0
0
0
0
1
1
0
0
2
1
0
1
0
0
0
0
1
0
1
0
0
1
0
0
1
1
0
0
1
1
1
1
0
1
3
5
2
2
0
0
3
0
0
0
1
0
0
2
4
2
1
0
3
0
0
0
0
0
1
0
0
1
0
5
1
0
0
2
0
1
4
0
1
1
0
0
1
0
0
0
0
0
0
1
0
1
0
3
2
0
2
1
0
0
0
0
2
2
0
1
2
1
2
1
0
0
0
0
0
0
1
0
1
0
2
4
3
0
3
3
1
0
5
1
0
1
0
1
0
0
0
0
0
1
0
0
0
0
3
0
1
1
1
0
0
0
3
0
3
3
2
1
1
0
2
0
2
0
2
4
2
2
4
1
0
1
0
0
1
0
3
2
0
0
1
5
3
0
2
1
3
0
1
0
0
0
2
2
0
3
2
2
4
0
0
4
0
0
0
0
0
0
0
1
4
0
1
0
0
3
0
3
3
1
0
0
3
0
1
0
4
0
0
0
4
0
1
0
2
0
0
1
1
1
3
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
0
5
0
0
3
5
0
1
0
3
2
1
1
1
0
0
0
0
0
3
0
0
0
0
0
1
2
1
0
0
0
2
0
1
0
3
0
0
0
1
0
3
3
2
0
2
1
1
0
0
1
2
0
0
0
0
0
0
0
3
0
0
1
0
0
2
1
0
0
1
0
1
0
0
0
0
0
0
3
3
3
1
0
0
0
0
0
1
0
1
1
0
0
2
1
0
0
0
0
0
3
0
0
0
0
0
2
2
0
0
0
1
0
1
0
0
1
2
0
3
3
0
4
0
0
0
0
0
0
3
0
2
1
1
1
0
0
2
0
3
0
2
0
0
4
1
0
0
1
0
0
1
1
2
0
1
4
0


0
2
2
0
3
0
2
0
1
1
1
0
0
0
0
0
2
0
0
3
0
1
3
3
0
1
1
0
3
1
0
0
0
1
0
0
4
0
4
2
0
3
3
3
0
3
2
0
0
0
0
0
3
0
1
0
0
2
0
0
1
0
1
0
2
2
2
3
0
1
0
2
1
1
0
0
0
2
1
0
0
2
1
1
0
1
2
1
0
2
0
0
4
1
0
3
0
0
0
0
0
3
1
0
2
0
1
0
0
0
0
0
0
1
0
1
0
2
0
1
0
0
1
1
2
0
0
0
2
0
0
0
0
0
0
1
1
0
0
0
1
3
0
0
1
0
2
1
0
1
2
0
0
0
2
0
0
2
2
1
0
0
0
4
1
4
1
0
0
3
0
0
0
0
0
2
1
0
1
1
3
0
0
0
0
0
2
0
0
0
0
3
0
0
0
1
0
0
0
2
3
1
0
1
5
0
0
0
2
0
4
1
0
0
1
1
1
0
0
0
0
0
1
0
0
0
0
0
4
0
0
1
0
0
0
4
3
0
0
1
0
0
2
2
2
0
0
0
3
0
3
1
5
1
0
0
2
0
1
1
0
0
0
0
0
3
1
0
1
4
0
0
2
3
0
0
2
0
0
0
0
0
1
1
0
1
1
4
0
2
0
0
1
0
1
2
0
1
0
1
0
4
1
0
1
2
1
5
0
0
0
1
0
0
1
0
3
1
2
3
0
0
0
0
2
0
0
0
1
0
1
1
3
0
1
2
3
2
3
0
0
2
0
2
0
0
0
0
2
2
0
0
0
0
0
0
0
4
3
1
0
1
0
0
0
0
0
3
0
0
2
0
2
0
0
1
0
0
0
2
0
0
1
0
1
1
0
1
2
2
3
0
1
0
1
0
0
0
0
0
0
0
0
1
1
5
0
0
0
2
0
0
1
0
1
2
0
1
1
0
0
0
1
0
0
2
0
0
3
5
0
0
0
3
0
3
0
0
0
1
0
0
0
0
0
5
3
0
1
2
0
2
0
1
1
1
0
0
3
0
0
0
1
0
1
2
0
0
2
0
2
0
4
0
2
0
0
0
2
3
0
1
3
0
1
0
0
0
3
0
1
1
0
0
0
0
0
0
0
2


2.2. Удалите стоп-слова из описаний рецептов. Какую долю об общего количества слов составляли стоп-слова? Сравните топ-10 самых часто употребляемых слов до и после удаления стоп-слов.

In [106]:
stop_words = set(stopwords.words("english"))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [167]:
buf_recipes = pd.DataFrame(columns=['description', 'filtered_description'])
buf_recipes['description'] = recipes_descriptions["description"]
buf_recipes['filtered_description'] = [list() for _ in range(len(buf_recipes['description']))] 
buf_recipes

Unnamed: 0,description,filtered_description
0,an original recipe created by chef scott meska...,[]
1,my children and their friends ask for my homem...,[]
2,these were so go it surprised even me,[]
3,my sisterinlaw made these for us at a family g...,[]
4,i think a fondue is a very romantic casual din...,[]
5,ok my heritage has been revealed these are s...,[]
6,this is a very versatile and widely enjoyed pa...,[]
7,you just have to try it to believe it,[]
8,this is the recipe that we use at my school ca...,[]
9,since there are already 411 recipes for brocco...,[]


In [179]:
common_counts = 0
filtered_counts = 0
for index, row in buf_recipes.iterrows():
    try:
        for word in word_tokenize(row['description']):
            common_counts += len(word)
            if word.casefold() not in stop_words:
                filtered_counts += len(word)
                buf_recipes['filtered_description'][index].append(word)
        buf_recipes['filtered_description'][index] = " ".join(buf_recipes['filtered_description'][index])
    except TypeError:
        continue
buf_recipes

Unnamed: 0,description,filtered_description
0,an original recipe created by chef scott meska...,original recipe created chef scott meskan geor...
1,my children and their friends ask for my homem...,children friends ask homemade popsicles mornin...
2,these were so go it surprised even me,go surprised even go surprised even go surpris...
3,my sisterinlaw made these for us at a family g...,sisterinlaw made us family get together delici...
4,i think a fondue is a very romantic casual din...,think fondue romantic casual dinner wonderful ...
5,ok my heritage has been revealed these are s...,ok heritage revealed simply wonderful favorite...
6,this is a very versatile and widely enjoyed pa...,versatile widely enjoyed pasta dish chicken as...
7,you just have to try it to believe it,try believe try believe try believe try believ...
8,this is the recipe that we use at my school ca...,recipe use school cafeteria chocolate chip coo...
9,since there are already 411 recipes for brocco...,since already 411 recipes broccoli casserole p...


In [180]:
print(common_counts)
print(filtered_counts)

4652350
3321299


In [181]:
print(f'Доля стоп-слов: {(common_counts - filtered_counts) / common_counts}')

Доля стоп-слов: 0.28610293722527325


In [184]:
common_text = ""
filtered_text = ""
for index, row in buf_recipes.iterrows():
    try:
        common_text += row['description'] + ' '
        filtered_text += row['filtered_description'] + ' '
    except TypeError:
        continue

In [185]:
FreqDist(nltk.Text(word_tokenize(common_text))).most_common(10)

[('the', 40210),
 ('a', 34994),
 ('and', 30279),
 ('this', 27048),
 ('i', 25111),
 ('to', 23499),
 ('is', 20290),
 ('it', 19863),
 ('of', 18372),
 ('for', 15988)]

In [186]:
FreqDist(nltk.Text(word_tokenize(filtered_text))).most_common(10)

[('recipe', 74785),
 ('make', 31765),
 ('time', 25900),
 ('use', 23175),
 ('great', 22265),
 ('like', 20875),
 ('easy', 20875),
 ('one', 19430),
 ('good', 19100),
 ('made', 19070)]

### Векторное представление текста

3.1 Выберите случайным образом 5 рецептов из набора данных. Представьте описание каждого рецепта в виде числового вектора при помощи `TfidfVectorizer`

In [241]:
random_recipes = recipes_descriptions.sample(5)
random_recipes["description"] = random_recipes["description"].map(sent_tokenize)
random_recipes

Unnamed: 0,name,description
12459,granny s fast and easy chili,[i have been making our chili this way for yea...
22328,red lentil and vegetable stew,[wonderful stew can be a vegetarian meal on it...
5410,cherry jello cookies gift mix in a jar,[gift jar directions at bottom of the recipe ...
4851,carrot banana oat bread,[so tasty and relatively good for you i eat a...
29295,white christmas pie,[found this recipe in a christmas book of the ...


In [261]:
tv = TfidfVectorizer()

recipes_text = list()
for index, row in random_recipes.iterrows():
    recipes_text.append(row['description'][0])
    
tv_fit = tv.fit_transform(recipes_text)
tv_array = tv_fit.toarray()
tv_array

array([[0.        , 0.        , 0.        , 0.14122884, 0.14122884,
        0.11394254, 0.14122884, 0.        , 0.        , 0.0672963 ,
        0.14122884, 0.        , 0.        , 0.1345926 , 0.11394254,
        0.14122884, 0.        , 0.        , 0.        , 0.        ,
        0.14122884, 0.        , 0.        , 0.14122884, 0.        ,
        0.11394254, 0.        , 0.11394254, 0.        , 0.        ,
        0.        , 0.        , 0.14122884, 0.42368652, 0.        ,
        0.        , 0.        , 0.28245768, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.0945826 , 0.        , 0.        ,
        0.        , 0.        , 0.14122884, 0.        , 0.14122884,
        0.        , 0.14122884, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

3.2 Вычислите близость между каждой парой рецептов, выбранных в задании 3.1, используя косинусное расстояние (`scipy.spatial.distance.cosine`) Результаты оформите в виде таблицы `pd.DataFrame`. В качестве названий строк и столбцов используйте названия рецептов.

In [263]:
cosine_table = pd.DataFrame(index=random_recipes['name'], columns=random_recipes['name'])
cosine_table

name,granny s fast and easy chili,red lentil and vegetable stew,cherry jello cookies gift mix in a jar,carrot banana oat bread,white christmas pie
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
granny s fast and easy chili,,,,,
red lentil and vegetable stew,,,,,
cherry jello cookies gift mix in a jar,,,,,
carrot banana oat bread,,,,,
white christmas pie,,,,,


In [267]:
for i in range(len(random_recipes)):
    for j in range(i, len(random_recipes)):
        cosine_table.iloc[i][j] = 1 - distance.cosine(tv_array[i], tv_array[j])
        cosine_table.iloc[j][i] = 1 - distance.cosine(tv_array[i], tv_array[j])

In [268]:
cosine_table

name,granny s fast and easy chili,red lentil and vegetable stew,cherry jello cookies gift mix in a jar,carrot banana oat bread,white christmas pie
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
granny s fast and easy chili,1.0,0.229899,0.118714,0.0987741,0.139808
red lentil and vegetable stew,0.229899,1.0,0.136756,0.107919,0.134234
cherry jello cookies gift mix in a jar,0.118714,0.136756,1.0,0.0787013,0.181482
carrot banana oat bread,0.0987741,0.107919,0.0787013,1.0,0.200332
white christmas pie,0.139808,0.134234,0.181482,0.200332,1.0


3.3 Какие рецепты являются наиболее похожими? Прокомментируйте результат (словами).

Наиболее похожими являются рецепты "granny s fast and easy chili" и "carrot banana oat bread" и рецепты "cherry jello cookies gift mix in a jar" и "carrot banana oat bread". 