# Задача: сравнение предложений

Дан набор предложений, скопированных с Википедии. Каждое из них имеет "кошачью тему" в одном из трех смыслов:

- кошки (животные)
- UNIX-утилита cat для вывода содержимого файлов
- версии операционной системы OS X, названные в честь семейства кошачьих

Ваша задача — найти два предложения, которые ближе всего по смыслу к расположенному в самой первой строке. В качестве меры близости по смыслу мы будем использовать косинусное расстояние.

## Solution 

In [1]:
import re
import numpy as np
from scipy.spatial.distance import cosine

In [2]:
# Reading file.
sentences = []
with open("data/sentences.txt", "r") as file_obj:
    for line in file_obj:
        sentences.append(line.strip().lower())

In [3]:
# Tokenization.
sentences_tokenized = list(map(lambda sentence: re.split('[^a-z]', sentence), sentences))

In [4]:
# Whitespaces deletion.
sentences_tokenized = list(map(lambda sentence: list(filter(lambda word: word != "", sentence)), sentences_tokenized))

In [5]:
# Dict from all words.
all_words = []
for sentence in sentences_tokenized:
    for word in sentence:
        all_words.append(word)

unique_words = set(all_words)
keys = range(len(unique_words))
dictionary = dict(zip(keys, unique_words))

In [6]:
# Matrix n*d, where n - number of 
# sentences, d - number of unique words.
# a(i, j) countains number of appearences
# of word j in in sentence i.
matrix = np.array([[sentence.count(word) for word in dictionary.values()] for sentence in sentences_tokenized])
print(matrix.shape)

(22, 254)


In [7]:
# Counting cosine distance between 1st
# sentence and other sentences.
index_of_sentence_to_compare_with = 0
distances = []
fst_sentence = matrix[index_of_sentence_to_compare_with]
for line in matrix:
    distance = cosine(fst_sentence, line)
    distances.append(distance)

In [8]:
# Selecting 2 indexes with minimum values.
fst_min = distances.index(sorted(distances)[1])
snd_min = distances.index(sorted(distances)[2])

In [9]:
# Seleceting 2 indexes with maximum values.
fst_max = distances.index(sorted(distances)[len(distances)-1])
snd_max = distances.index(sorted(distances)[len(distances)-2])

In [10]:
# Writing 2 indexes with minimum values to file.
with open('output/submission-1.txt', 'w') as file_obj:
    file_obj.write("{} {}".format(fst_min, snd_min))    

## Analyzing solution 

Given sentence was:

In [11]:
print(sentences[index_of_sentence_to_compare_with])

in comparison to dogs, cats have not undergone major changes during the domestication process.


Nearest to that sentence are sentences:

In [12]:
print(sentences[fst_min])
print(sentences[snd_min])

domestic cats are similar in size to the other members of the genus felis, typically weighing between 4 and 5 kg (8.8 and 11.0 lb).
in one, people deliberately tamed cats in a process of artificial selection, as they were useful predators of vermin.


Most far sentences are sentences:

In [13]:
print(sentences[fst_max])
print(sentences[snd_max])

os x mountain lion was released on july 25, 2012 for purchase and download through apple's mac app store, as part of a switch to releasing os x versions online and every year.
as cat simply catenates streams of bytes, it can be also used to concatenate binary files, where it will just concatenate sequence of bytes.


So we can see that the nearest sentences are about animal cats, and the most far sentences are about os x and terminal method.