# Sentences comparison

Given a set of sentences from Wiki. Each of them has a cat theme in one of three ways:
- cats (animals)
- Unix utility cat
- versions of OS X operating system, named after the cat family

**The task** is to find two sentences that are closest to the first sentence.

In [120]:
import numpy as np
import re
from functools import reduce
from scipy.spatial import distance

First, bring all lines to the lower case and tokenize.

In [121]:
tokenized_lines = []
with open('sentences.txt') as file:
    lines = file.readlines()
    for line in lines:
        unfiltered_words = re.split('[^a-z]', line.lower()) 
        tokenized_lines.append(list(filter(lambda x: x != '', unfiltered_words)))

In [122]:
tokenized_lines

[['in',
  'comparison',
  'to',
  'dogs',
  'cats',
  'have',
  'not',
  'undergone',
  'major',
  'changes',
  'during',
  'the',
  'domestication',
  'process'],
 ['as',
  'cat',
  'simply',
  'catenates',
  'streams',
  'of',
  'bytes',
  'it',
  'can',
  'be',
  'also',
  'used',
  'to',
  'concatenate',
  'binary',
  'files',
  'where',
  'it',
  'will',
  'just',
  'concatenate',
  'sequence',
  'of',
  'bytes'],
 ['a',
  'common',
  'interactive',
  'use',
  'of',
  'cat',
  'for',
  'a',
  'single',
  'file',
  'is',
  'to',
  'output',
  'the',
  'content',
  'of',
  'a',
  'file',
  'to',
  'standard',
  'output'],
 ['cats',
  'can',
  'hear',
  'sounds',
  'too',
  'faint',
  'or',
  'too',
  'high',
  'in',
  'frequency',
  'for',
  'human',
  'ears',
  'such',
  'as',
  'those',
  'made',
  'by',
  'mice',
  'and',
  'other',
  'small',
  'animals'],
 ['in',
  'one',
  'people',
  'deliberately',
  'tamed',
  'cats',
  'in',
  'a',
  'process',
  'of',
  'artificial',
  's

Then get all the unique words from the sentences.

In [123]:
uniq_words = list(set(sum(words, [])))

In [50]:
uniq_words

['vermin',
 'however',
 'yosemite',
 'off',
 'content',
 'community',
 'in',
 'single',
 'are',
 'wild',
 'version',
 'felis',
 'computers',
 'people',
 'frequency',
 'offered',
 'installs',
 'made',
 'size',
 'year',
 'was',
 'for',
 'started',
 'features',
 'high',
 'drive',
 'use',
 'entirely',
 'kg',
 'purchase',
 'redirected',
 'needing',
 'domestication',
 'has',
 'release',
 'symbols',
 'were',
 'right',
 'interactive',
 'their',
 'releases',
 'osx',
 'sequence',
 'october',
 'command',
 'factory',
 'safari',
 'safer',
 'process',
 'unix',
 'can',
 'installed',
 'longer',
 'both',
 'upgrade',
 'redirection',
 'keyboards',
 'developed',
 's',
 'allows',
 'later',
 'released',
 'output',
 'cats',
 'they',
 'world',
 'information',
 'count',
 'file',
 'online',
 'where',
 'patch',
 'than',
 'available',
 'arguments',
 'no',
 'x',
 'disk',
 'error',
 'any',
 'july',
 'time',
 'update',
 'an',
 'small',
 'according',
 'or',
 'os',
 'enhancements',
 'lb',
 'commands',
 'basic',
 'pipe

Next form a matrix in which the element with index (i, j) must be equal the number of entries of the j-th word in the i-th sentence.

In [53]:
matrix = np.zeros((len(tokenized_lines), len(uniq_words)))

In [64]:
for i in range(len(uniq_words)):
    for j in range(len(tokenized_lines)):
        num = reduce((lambda x, y: x + (1 if y == uniq_words[i] else 0)), tokenized_lines[j], 0)
        if num:
            matrix[j][i] = num

At the end count the distances between the first row of the matrix and the others, and select two minimal.

In [124]:
dist = [distance.cosine(matrix[0], row) for row in matrix]

In [126]:
sorted_dist = sorted(dist[1:])

In [127]:
sorted_dist

[0.7327387580875756,
 0.7770887149698589,
 0.8250364469440588,
 0.8328165362273942,
 0.8396432548525454,
 0.8406361854220809,
 0.8427572744917122,
 0.8644738145642124,
 0.8703592552895671,
 0.8740118423302576,
 0.8804771390665607,
 0.8842724875284311,
 0.8885443574849294,
 0.8951715163278082,
 0.9055088817476932,
 0.9258750683338899,
 0.9402385695332803,
 0.9442721787424647,
 0.9442721787424647,
 0.9527544408738466,
 0.956644501523794]

In [128]:
res = str(dist.index(sorted_dist[0])) + ' ' + str(dist.index(sorted_dist[1]))

In [129]:
out = open('submission-1.txt', 'w+')
out.write(res)
out.close()

In [130]:
!cat submission-1.txt

6 4