# Введение в обработку текста на естественном языке

Материалы:
* Макрушин С.В. Лекция 9: Введение в обработку текста на естественном языке\
* https://realpython.com/nltk-nlp-python/
* https://scikit-learn.org/stable/modules/feature_extraction.html

## Задачи для совместного разбора

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pymorphy2
from nltk.metrics.distance import edit_distance

1. Считайте слова из файла `litw-win.txt` и запишите их в список `words`. В заданном предложении исправьте все опечатки, заменив слова с опечатками на ближайшие (в смысле расстояния Левенштейна) к ним слова из списка `words`. Считайте, что в слове есть опечатка, если данное слово не содержится в списке `words`. 

In [2]:
s1 = "ПИ19-3"
s2 = "ПМ19-3"
edit_distance(s1, s2)

1

In [3]:
text = '''с велечайшим усилием выбравшись из потока убегающих людей Кутузов со свитой уменьшевшейся вдвое поехал на звуки выстрелов русских орудий'''

In [4]:
word = "велечайшим"
with open ("./data/litw-win.txt", "r", encoding='windows-1251') as fp:
    words = [line.strip().split()[-1] for line in fp]
words[-5:]

['высокопревосходительства',
 'попреблагорассмотрительст',
 'попреблагорассмотрительствующемуся',
 'убегающих',
 'уменьшившейся']

In [5]:
min(words, key=lambda w: edit_distance(w, word))

'величайшим'

2. Разбейте текст из формулировки задания 1 на слова; проведите стеминг и лемматизацию слов

**Стемминг - это выкидывание окончаний и т д со слова**

In [6]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

In [7]:
stemmer = SnowballStemmer('russian')

for word in word_tokenize(text):
    result = stemmer.stem(word)
    print(result)

с
велечайш
усил
выбра
из
поток
убега
люд
кутуз
со
свит
уменьшевш
вдво
поеха
на
звук
выстрел
русск
оруд


**Лемматизация - это более правильное приведение слова в начальную форму**

In [8]:
morph = pymorphy2.MorphAnalyzer()
for word in word_tokenize(text):
    result = morph.parse(word)[0].normalized.word
    print(result)

с
велечайший
усилие
выбраться
из
поток
убегать
человек
кутузов
с
свита
уменьшевшийся
вдвое
поехать
на
звук
выстрел
русский
орудие


3. Преобразуйте предложения из формулировки задания 1 в векторы при помощи `CountVectorizer`.

In [9]:
text = '''Считайте слова из файла `litw-win.txt` и запишите их в список `words`. В заданном предложении исправьте все опечатки, заменив слова с опечатками на ближайшие (в смысле расстояния Левенштейна) к ним слова из списка `words`. Считайте, что в слове есть опечатка, если данное слово не содержится в списке `words`. '''
sents = sent_tokenize(text)
sents

['Считайте слова из файла `litw-win.txt` и запишите их в список `words`.',
 'В заданном предложении исправьте все опечатки, заменив слова с опечатками на ближайшие (в смысле расстояния Левенштейна) к ним слова из списка `words`.',
 'Считайте, что в слове есть опечатка, если данное слово не содержится в списке `words`.']

In [10]:
cv = CountVectorizer()
cv.fit(sents)
sents_cv = cv.transform(sents).toarray()
sents_cv

array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1,
        1, 1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
        0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1]])

In [11]:
sents_cv.shape

(3, 35)

In [12]:
cv.vocabulary_

{'считайте': 32,
 'слова': 24,
 'из': 12,
 'файла': 33,
 'litw': 0,
 'win': 2,
 'txt': 1,
 'запишите': 11,
 'их': 14,
 'список': 31,
 'words': 3,
 'заданном': 9,
 'предложении': 22,
 'исправьте': 13,
 'все': 5,
 'опечатки': 21,
 'заменив': 10,
 'опечатками': 20,
 'на': 16,
 'ближайшие': 4,
 'смысле': 27,
 'расстояния': 23,
 'левенштейна': 15,
 'ним': 18,
 'списка': 29,
 'что': 34,
 'слове': 25,
 'есть': 8,
 'опечатка': 19,
 'если': 7,
 'данное': 6,
 'слово': 26,
 'не': 17,
 'содержится': 28,
 'списке': 30}

## Лабораторная работа 9

### Расстояние редактирования

1.1 Загрузите предобработанные описания рецептов из файла `preprocessed_descriptions.csv`. Получите набор уникальных слов `words`, содержащихся в текстах описаний рецептов (воспользуйтесь `word_tokenize` из `nltk`). 

In [13]:
import pandas as pd
from nltk.tokenize import word_tokenize

In [14]:
preprocessed_descriptions = pd.read_csv("./data/preprocessed_descriptions.csv")
preprocessed_descriptions

Unnamed: 0.1,Unnamed: 0,name,preprocessed_descriptions
0,0,george s at the cove black bean soup,an original recipe created by chef scott meska...
1,1,healthy for them yogurt popsicles,my children and their friends ask for my homem...
2,2,i can t believe it s spinach,these were so go it surprised even me
3,3,italian gut busters,my sisterinlaw made these for us at a family g...
4,4,love is in the air beef fondue sauces,i think a fondue is a very romantic casual din...
...,...,...,...
29995,29995,zurie s holey rustic olive and cheddar bread,this is based on a french recipe but i changed...
29996,29996,zwetschgenkuchen bavarian plum cake,this is a traditional fresh plum cake thought ...
29997,29997,zwiebelkuchen southwest german onion cake,this is a traditional late summer early fall s...
29998,29998,zydeco soup,this is a delicious soup that i originally fou...


In [15]:
words_set = set()
words_list = list()
words = [word_tokenize(item) for item in preprocessed_descriptions["preprocessed_descriptions"].to_list() if isinstance(item, str)]

[[words_set.add(x) for x in item] for item in words]
[[words_list.append(x) for x in item] for item in words]

for item in words:
    print(item)
#for item in preprocessed_descriptions["preprocessed_descriptions"].to_list():
#    print(type(item))

['an', 'original', 'recipe', 'created', 'by', 'chef', 'scott', 'meskan', 'georges', 'at', 'the', 'cove', 'we', 'enjoyed', 'this', 'when', 'we', 'visited', 'this', 'restaurant', 'in', 'la', 'jolla', 'california', 'this', 'recipe', 'is', 'requested', 'so', 'often', 'they', 'have', 'it', 'printed', 'and', 'ready', 'at', 'the', 'hostess', 'stand', 'its', 'unbeatable', 'at', 'the', 'restaurant', 'but', 'i', 'do', 'a', 'pretty', 'good', 'job', 'at', 'home', 'too', 'if', 'i', 'do', 'say', 'so', 'myself']
['my', 'children', 'and', 'their', 'friends', 'ask', 'for', 'my', 'homemade', 'popsicles', 'morning', 'noon', 'and', 'night', 'i', 'never', 'turn', 'them', 'down', 'who', 'am', 'i', 'to', 'tell', 'them', 'that', 'they', 'are', 'good', 'for', 'them', 'for', 'variety', 'i', 'substitute', 'different', 'flavours', 'of', 'frozen', 'juice', 'grape', 'fruit', 'punch', 'tropical', 'etc']
['these', 'were', 'so', 'go', 'it', 'surprised', 'even', 'me']
['my', 'sisterinlaw', 'made', 'these', 'for', 'us',

['this', 'is', 'a', 'terribly', 'sinful', 'dish', 'but', 'worth', 'the', 'splurge']
['this', 'is', 'a', 'family', 'favourite', 'from', 'generations', 'back', 'this', 'receipe', 'freezes', 'very', 'well', 'it', 'is', 'a', 'fabulous', 'make', 'ahead', 'receipe', 'it', 'is', 'excellent', 'for', 'a', 'potluck', 'or', 'just', 'for', 'a', 'fabulous', 'change', 'it', 'is', 'a', 'must', 'try', 'i', 'have', 'never', 'come', 'accross', 'a', 'sweet', 'and', 'sour', 'meetball', 'receipe', 'as', 'great', 'as', 'this', 'one']
['these', 'are', 'so', 'good', 'this', 'is', 'based', 'on', 'a', 'recipe', 'from', 'bryanna', 'clark', 'grogan', 'and', 'learning', 'about', 'substitutions', 'on', 'flours', 'and', 'cocoa', 'from', 'chef', 'deborah']
['this', 'my', 'favorite', 'peas', 'and', 'peanut', 'recipe']
['found', 'this', 'in', 'martha', 'stewart', 'living', 'two', 'years', 'ago', 'the', 'best', 'pie', 'crust', 'recipe', 'ive', 'found', 'yet', 'the', 'vinegar', 'in', 'the', 'crust', 'makes', 'it', 'very'

['easy', 'to', 'prepare', 'and', 'never', 'last', 'very', 'long']
['another', 'jar', 'mix', 'that', 'is', 'well', 'recieved', 'when', 'given']
['from', 'southern', 'living', 'the', 'kids', 'can', 'help', 'make', 'this', 'one', 'for', 'the', 'fourth', 'of', 'july']
['this', 'makes', 'a', 'huge', 'amount', 'and', 'they', 'are', 'simply', 'the', 'best', 'bars', 'my', 'really', 'easy', 'and', 'good', 'chocolate', 'buttercream', 'frosting', 'recipe89207', 'goes', 'great', 'with', 'these', 'bars']
['just', 'like', 'my', 'grandmas']
['this', 'is', 'the', 'cookie', 'that', 'has', 'made', 'me', 'sortofslightlykindof', 'famous', 'coworkers', 'always', 'ask', 'for', 'the', 'recipe', 'and', 'guests', 'at', 'my', 'home', 'never', 'believe', 'that', 'they', 'are', 'homebaked', 'because', 'they', 'are', 'so', 'nice', 'and', 'puffy', 'but', 'still', 'flavorful', 'soft', 'and', 'chewy', 'i', 'sent', 'these', 'to', 'my', 'husband', 'many', 'many', 'times', 'while', 'he', 'has', 'been', 'overseas', 'afte

['like', 'cinnamon', 'rolls', 'only', 'with', 'cream', 'cheese', 'filling', 'you', 'can', 'also', 'make', 'a', 'lemon', 'filling', 'by', 'using', 'lemon', 'zest', 'and', 'lemon', 'juice', 'instead', 'of', 'orange', 'if', 'you', 'prefer', 'this', 'is', 'a', 'finnish', 'recipe', 'and', 'finns', 'dont', 'usually', 'go', 'for', 'very', 'gooey', 'and', 'sweet', 'things', 'the', 'rolls', 'are', 'subtly', 'sweet', 'and', 'have', 'a', 'fresh', 'orange', 'taste', 'you', 'can', 'add', 'more', 'sugar', 'to', 'the', 'filling', 'if', 'you', 'like', 'or', 'try', 'making', 'more', 'of', 'it']
['i', 'had', 'a', 'similar', 'spread', 'in', 'a', 'bagel', 'shop', 'years', 'ago', 'and', 'loved', 'it', 'i', 'tried', 'to', 'make', 'it', 'myself', 'and', 'it', 'just', 'didnt', 'taste', 'the', 'same', 'then', 'i', 'came', 'upon', 'this', 'recipe', 'that', 'taste', 'really', 'good', 'if', 'you', 'like', 'a', 'veggie', 'spread', 'on', 'your', 'bagel']
['a', 'new', 'combo', 'that', 'dh', 'tried', 'tonight']
['thi

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [16]:
print(f"Весего {len(words_list)} слов\nСреди них {len(words_set)} уникальных")

Весего 1069254 слов
Среди них 32868 уникальных


1.2 Сгенерируйте 5 пар случайно выбранных слов и посчитайте между ними расстояние редактирования.

In [17]:
import random
data = random.sample(list(words_set), 10)
for i in range(0,len(data),2):
    x, y = data[i], data[i+1]
    print(edit_distance(x, y), x, y)


8 fruitcake familystyle
7 despair durcholz
5 roast much
7 cazuela occupies
9 skilletsthe kaysville


1.3 Напишите функцию, которая для заданного слова `word` возвращает `k` ближайших к нему слов из списка `words` (близость слов измеряется с помощью расстояния Левенштейна)<br>Список words с 1.1

In [18]:
# Я все жду, когда Jupyter начнет 3.10 поддерживать с нативными generic typing
from typing import Set, List

In [19]:
def same_words(word: str, k: int, words_data: Set[str]) -> List[str]:
    """Функция для возврата k подобных слов для word из коллекции words_data"""
    buf_tuple = [(edit_distance(word, item), item) for item in words_data]
    buf_tuple.sort(key=lambda x: x[0])
    return buf_tuple[:k]

In [20]:
same_words("seedless", 11, words_set)

[(0, 'seedless'),
 (1, 'needless'),
 (2, 'needles'),
 (2, 'endless'),
 (3, 'needle'),
 (3, 'seeds'),
 (3, 'sweetness'),
 (3, 'seeded'),
 (3, 'swedes'),
 (3, 'feeders'),
 (3, 'eggless')]

### Стемминг, лемматизация

2.1 На основе результатов 1.1 создайте `pd.DataFrame` со столбцами: 
    * word
    * stemmed_word 
    * normalized_word 

Столбец `word` укажите в качестве индекса. 

Для стемминга воспользуйтесь `SnowballStemmer`, для лемматизации слов - `WordNetLemmatizer`. Сравните результаты стемминга и лемматизации.

In [21]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer

lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer('english')

In [22]:
words_df = pd.DataFrame(words_set)
words_df.columns = ['word']
words_df['stemmed_word'] = words_df.apply(lambda x: stemmer.stem(x["word"]), axis=1)
words_df['normalized_word'] = words_df.apply(lambda x: lemmatizer.lemmatize(x["word"], "v"), axis=1)
words_df[(words_df["word"] != words_df["normalized_word"]) & (words_df["stemmed_word"] != words_df["normalized_word"])]


Unnamed: 0,word,stemmed_word,normalized_word
61,titled,titl,title
153,found,found,find
163,stuffing,stuf,stuff
206,served,serv,serve
228,worrying,worri,worry
...,...,...,...
32612,stuck,stuck,stick
32692,agrees,agre,agree
32699,tasted,tast,taste
32728,plunging,plung,plunge


2.2. Удалите стоп-слова из описаний рецептов. Какую долю об общего количества слов составляли стоп-слова? Сравните топ-10 самых часто употребляемых слов до и после удаления стоп-слов.

In [23]:
#import nltk
#nltk.download('stopwords')

In [24]:
import nltk
from nltk.corpus import stopwords
stopwords_set = set(stopwords.words('english'))

In [25]:
words_filtered = [item for item in words_list if item not in stopwords_set]
diff = round(len(words_filtered)/len(words_list)*100,2)
print(f"Всего слов: {len(words_list)}\nС удалением стоп-слов: {len(words_filtered)}\nДоля стоп-слов: {diff}%")

Всего слов: 1069254
С удалением стоп-слов: 580889
Доля стоп-слов: 54.33%


Топ-10 слов до удаления

In [26]:
freq = nltk.FreqDist(words_list)
for word, number in freq.most_common(10):
    print(f"{number} -> {word}")

40072 -> the
34951 -> a
30245 -> and
26859 -> this
24836 -> i
23471 -> to
20285 -> is
19756 -> it
18364 -> of
15939 -> for


Топ-10 слов после удаления

In [27]:
freq = nltk.FreqDist(words_filtered)
for word, number in freq.most_common(10):
    print(f"{number} -> {word}")

14871 -> recipe
6326 -> make
5137 -> time
4620 -> use
4430 -> great
4167 -> like
4152 -> easy
3872 -> one
3810 -> made
3791 -> good


### Векторное представление текста

3.1 Выберите случайным образом 5 рецептов из набора данных. Представьте описание каждого рецепта в виде числового вектора при помощи `TfidfVectorizer`

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [29]:
preprocessed_descriptions = pd.read_csv("./data/preprocessed_descriptions.csv")
data = preprocessed_descriptions.sample(5)
data

Unnamed: 0.1,Unnamed: 0,name,preprocessed_descriptions
24317,24317,simply delicious cookies,these are sooooo good recipe makes a lot but t...
5874,5874,chicken paprikash aka sour cream soup,a great hungarian meal my italian grandmother ...
12448,12448,grandpa long s blueberry cake,my mother recommended i post this recipe as sh...
9393,9393,dhal soup,dhal is a term traditionally used to describe ...
27161,27161,tex mex cornbread,this is an easy yet tasty cornbread i serve it...


## Через fit для всего датасета, transform для конкретного предложения

In [30]:
vectorizer = TfidfVectorizer(analyzer="word", stop_words="english")
vectorizer.fit(data["preprocessed_descriptions"])

def vectorizer_processing(x):
    sents = [x["preprocessed_descriptions"]]
    vector = vectorizer.transform(sents)
    return vector.toarray()

data['TfidfVectorizer'] = data.apply(lambda x: vectorizer_processing(x), axis=1)

In [31]:
valuess = [(k,v) for (k, v) in dict(vectorizer.vocabulary_).items()]
valuess.sort(key=lambda x: x[1])
valuess

[('10', 0),
 ('asked', 1),
 ('childhood', 2),
 ('chili', 3),
 ('choice', 4),
 ('choys', 5),
 ('community', 6),
 ('cookbook', 7),
 ('cookie', 8),
 ('cookies', 9),
 ('cornbread', 10),
 ('dhal', 11),
 ('dish', 12),
 ('dozen', 13),
 ('easy', 14),
 ('enjoy', 15),
 ('enjoyed', 16),
 ('especially', 17),
 ('fiji', 18),
 ('freeze', 19),
 ('frosted', 20),
 ('frosting', 21),
 ('good', 22),
 ('got', 23),
 ('grandfather', 24),
 ('grandmother', 25),
 ('great', 26),
 ('holidays', 27),
 ('hungarian', 28),
 ('italian', 29),
 ('kitchen', 30),
 ('lasted', 31),
 ('legumes', 32),
 ('lentils', 33),
 ('long', 34),
 ('lot', 35),
 ('make', 36),
 ('makes', 37),
 ('making', 38),
 ('meal', 39),
 ('mother', 40),
 ('party', 41),
 ('plain', 42),
 ('polynesian', 43),
 ('post', 44),
 ('quickly', 45),
 ('recipe', 46),
 ('recommended', 47),
 ('remembered', 48),
 ('remembers', 49),
 ('said', 50),
 ('sam', 51),
 ('serve', 52),
 ('sooooo', 53),
 ('soup', 54),
 ('spicy', 55),
 ('swap', 56),
 ('tasty', 57),
 ('term', 58),
 (

In [32]:
for word, vector in zip(data["preprocessed_descriptions"].to_list(), data["TfidfVectorizer"].to_list()):
    print(vector.shape)
    print(f"{word}\n{vector}\n{'-'*10}\n")

(1, 65)
these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party
[[0.17275007 0.         0.         0.         0.         0.
  0.17275007 0.17275007 0.17275007 0.34550013 0.         0.
  0.         0.17275007 0.         0.         0.17275007 0.17275007
  0.         0.17275007 0.         0.         0.17275007 0.17275007
  0.         0.         0.13937367 0.17275007 0.         0.
  0.         0.         0.         0.         0.         0.17275007
  0.         0.34550013 0.17275007 0.         0.         0.17275007
  0.         0.         0.         0.17275007 0.418121   0.
  0.         0.         0.         0.         0.         0.17275007
  0.         0.         0.17275007 0.         0.         0.
  0.17275007 0.         0.         0.         0.17275

## Через fit_transform для всего датасета

In [33]:
vectorizer2 = TfidfVectorizer(analyzer="word", stop_words="english")
transform2 = vectorizer2.fit_transform(data["preprocessed_descriptions"].to_list())
transform2.shape

(5, 65)

In [34]:
valuess = [(k,v) for (k, v) in dict(vectorizer2.vocabulary_).items()]
valuess.sort(key=lambda x: x[1])
valuess

[('10', 0),
 ('asked', 1),
 ('childhood', 2),
 ('chili', 3),
 ('choice', 4),
 ('choys', 5),
 ('community', 6),
 ('cookbook', 7),
 ('cookie', 8),
 ('cookies', 9),
 ('cornbread', 10),
 ('dhal', 11),
 ('dish', 12),
 ('dozen', 13),
 ('easy', 14),
 ('enjoy', 15),
 ('enjoyed', 16),
 ('especially', 17),
 ('fiji', 18),
 ('freeze', 19),
 ('frosted', 20),
 ('frosting', 21),
 ('good', 22),
 ('got', 23),
 ('grandfather', 24),
 ('grandmother', 25),
 ('great', 26),
 ('holidays', 27),
 ('hungarian', 28),
 ('italian', 29),
 ('kitchen', 30),
 ('lasted', 31),
 ('legumes', 32),
 ('lentils', 33),
 ('long', 34),
 ('lot', 35),
 ('make', 36),
 ('makes', 37),
 ('making', 38),
 ('meal', 39),
 ('mother', 40),
 ('party', 41),
 ('plain', 42),
 ('polynesian', 43),
 ('post', 44),
 ('quickly', 45),
 ('recipe', 46),
 ('recommended', 47),
 ('remembered', 48),
 ('remembers', 49),
 ('said', 50),
 ('sam', 51),
 ('serve', 52),
 ('sooooo', 53),
 ('soup', 54),
 ('spicy', 55),
 ('swap', 56),
 ('tasty', 57),
 ('term', 58),
 (

In [35]:
buffer_list = []

In [36]:
for text, narray in zip(data["preprocessed_descriptions"].to_list(), transform2.toarray()):
    print(narray)
    print(f"\n*{text}*\n")
    for index, koeff in enumerate(narray):
        word = vectorizer2.get_feature_names()[index]
        print(f"{index}. {word} -> {koeff}")
    
    print("------------\n")

[0.17275007 0.         0.         0.         0.         0.
 0.17275007 0.17275007 0.17275007 0.34550013 0.         0.
 0.         0.17275007 0.         0.         0.17275007 0.17275007
 0.         0.17275007 0.         0.         0.17275007 0.17275007
 0.         0.         0.13937367 0.17275007 0.         0.
 0.         0.         0.         0.         0.         0.17275007
 0.         0.34550013 0.17275007 0.         0.         0.17275007
 0.         0.         0.         0.17275007 0.418121   0.
 0.         0.         0.         0.         0.         0.17275007
 0.         0.         0.17275007 0.         0.         0.
 0.17275007 0.         0.         0.         0.17275007]

*these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party*

0. 10 -> 0



In [37]:
df = pd.DataFrame(transform2.toarray(), columns = vectorizer2.get_feature_names())
df

Unnamed: 0,10,asked,childhood,chili,choice,choys,community,cookbook,cookie,cookies,...,spicy,swap,tasty,term,traditionally,travel,use,used,winter,years
0,0.17275,0.0,0.0,0.0,0.0,0.0,0.17275,0.17275,0.17275,0.3455,...,0.0,0.17275,0.0,0.0,0.0,0.17275,0.0,0.0,0.0,0.17275
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.239987,0.0,0.0,0.0
2,0.0,0.233751,0.233751,0.0,0.233751,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.188589,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.27735,0.0,0.0,0.0,0.0,...,0.27735,0.0,0.0,0.27735,0.27735,0.0,0.0,0.27735,0.0,0.0
4,0.0,0.0,0.0,0.377964,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.377964,0.0,0.0,0.0,0.0,0.0,0.377964,0.0


3.2 Вычислите близость между каждой парой рецептов, выбранных в задании 3.1, используя косинусное расстояние (`scipy.spatial.distance.cosine`) Результаты оформите в виде таблицы `pd.DataFrame`. В качестве названий строк и столбцов используйте названия рецептов.

## Через перемножение Numpy

In [38]:
data

Unnamed: 0.1,Unnamed: 0,name,preprocessed_descriptions,TfidfVectorizer
24317,24317,simply delicious cookies,these are sooooo good recipe makes a lot but t...,"[[0.17275006689808725, 0.0, 0.0, 0.0, 0.0, 0.0..."
5874,5874,chicken paprikash aka sour cream soup,a great hungarian meal my italian grandmother ...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
12448,12448,grandpa long s blueberry cake,my mother recommended i post this recipe as sh...,"[[0.0, 0.2337505900744905, 0.2337505900744905,..."
9393,9393,dhal soup,dhal is a term traditionally used to describe ...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.2773500981126146,..."
27161,27161,tex mex cornbread,this is an easy yet tasty cornbread i serve it...,"[[0.0, 0.0, 0.0, 0.3779644730092272, 0.0, 0.0,..."


In [39]:
vectorizer = TfidfVectorizer(analyzer="word", stop_words="english")
fit_transform = vectorizer2.fit_transform(data["preprocessed_descriptions"].to_list())

In [40]:
final1 = (transform2 * transform2.T).toarray()
final1

array([[1.        , 0.0334479 , 0.07885281, 0.        , 0.        ],
       [0.0334479 , 1.        , 0.04525883, 0.        , 0.        ],
       [0.07885281, 0.04525883, 1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ]])

In [41]:
df_final1 = pd.DataFrame(final1)
df_final1.columns = data["preprocessed_descriptions"].to_list()
df_final1.index = data["preprocessed_descriptions"].to_list()
df_final1

Unnamed: 0,these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party,a great hungarian meal my italian grandmother use to make this for my hungarian grandfather this is how i remembered how it was made,my mother recommended i post this recipe as she remembers it from her childhood i asked her if it had a frosting to go with it and she said it never lasted long enough to be frosted so enjoy it plain or use your choice of frosting,dhal is a term traditionally used to describe a spicy dish made with lentils or other legumes from sam choys polynesian kitchen fiji,this is an easy yet tasty cornbread i serve it in the winter with soup or chili
these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party,1.0,0.033448,0.078853,0.0,0.0
a great hungarian meal my italian grandmother use to make this for my hungarian grandfather this is how i remembered how it was made,0.033448,1.0,0.045259,0.0,0.0
my mother recommended i post this recipe as she remembers it from her childhood i asked her if it had a frosting to go with it and she said it never lasted long enough to be frosted so enjoy it plain or use your choice of frosting,0.078853,0.045259,1.0,0.0,0.0
dhal is a term traditionally used to describe a spicy dish made with lentils or other legumes from sam choys polynesian kitchen fiji,0.0,0.0,0.0,1.0,0.0
this is an easy yet tasty cornbread i serve it in the winter with soup or chili,0.0,0.0,0.0,0.0,1.0


## Через scipy spatial.distance.cosine

In [42]:
import scipy
import numpy as np
import itertools

In [43]:
max_pair = None
max_result = -1

In [44]:
coeff_dict = {}
vectorizer3 = TfidfVectorizer(analyzer="word", stop_words="english")
transform3 = vectorizer3.fit_transform(data["preprocessed_descriptions"].to_list())

all_data = list(zip(data["preprocessed_descriptions"].to_list(), transform3.toarray()))

for pair in itertools.product(all_data, repeat=2):
    
    text1, matrix1 = pair[0]
    text2, matrix2 = pair[1]
    result = scipy.spatial.distance.cosine(matrix1, matrix2)
    inverse_result = 1-result
    
    if text1 not in coeff_dict:
        coeff_dict[text1] = []
    coeff_dict[text1].append(inverse_result)
    

    if inverse_result > max_result and text1 != text2:
        max_result = inverse_result
        max_pair = (text1, text2)
    
    print(f"{text1}\n{text2}\n{inverse_result}\n")

these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party
these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party
1

these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party
a great hungarian meal my italian grandmother use to make this for my hungarian grandfather this is 

In [45]:
df_final2 = pd.DataFrame.from_dict(coeff_dict)
df_final2.columns = data["preprocessed_descriptions"].to_list()
df_final2.index = data["preprocessed_descriptions"].to_list()
df_final2

Unnamed: 0,these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party,a great hungarian meal my italian grandmother use to make this for my hungarian grandfather this is how i remembered how it was made,my mother recommended i post this recipe as she remembers it from her childhood i asked her if it had a frosting to go with it and she said it never lasted long enough to be frosted so enjoy it plain or use your choice of frosting,dhal is a term traditionally used to describe a spicy dish made with lentils or other legumes from sam choys polynesian kitchen fiji,this is an easy yet tasty cornbread i serve it in the winter with soup or chili
these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party,1.0,0.033448,0.078853,0.0,0.0
a great hungarian meal my italian grandmother use to make this for my hungarian grandfather this is how i remembered how it was made,0.033448,1.0,0.045259,0.0,0.0
my mother recommended i post this recipe as she remembers it from her childhood i asked her if it had a frosting to go with it and she said it never lasted long enough to be frosted so enjoy it plain or use your choice of frosting,0.078853,0.045259,1.0,0.0,0.0
dhal is a term traditionally used to describe a spicy dish made with lentils or other legumes from sam choys polynesian kitchen fiji,0.0,0.0,0.0,1.0,0.0
this is an easy yet tasty cornbread i serve it in the winter with soup or chili,0.0,0.0,0.0,0.0,1.0


3.3 Какие рецепты являются наиболее похожими? Прокомментируйте результат (словами).

Те коэффициенты, где результат ближе к единице больше всего

In [47]:
print(f"Из датасета выше больше всего совпадений в предложениях:\n\n{max_pair[0]}\n\n{max_pair[1]}\n\n{max_result}")

Из датасета выше больше всего совпадений в предложениях:

these are sooooo good recipe makes a lot but they go quickly the cookies freeze well but do not travel well i got the recipe from a community cookbook and have enjoyed making them for years especially around the holidays recipe makes about 10 dozen cookies so they are great for a cookie swap party

my mother recommended i post this recipe as she remembers it from her childhood  i asked her if it had a frosting to go with it and she said it never lasted long enough to be frosted  so enjoy it plain or use your choice of frosting

0.07885281375370234


In [48]:
set(max_pair[0].split(" ")) & set(max_pair[1].split(" "))

{'a', 'and', 'from', 'go', 'i', 'recipe', 'so'}