## List 3

**21**

For a given bitstring __b__ list all bitstrings **b’**, such that the Hamming distance between `b` and `b’` is equal 1.

In [1]:
def bitstring_hamming_distance(b):
    n = len(bin(b)) - 2
    pattern = 2**(n-1)
    while pattern > 0:
        print(format(b ^ pattern, f"0{n}b"))
        pattern = pattern // 2

In [2]:
bitstring = 0b10101
bitstring_hamming_distance(bitstring)

00101
11101
10001
10111
10100


**22**

Construct a function that returns a Jaccard similarity for two sets. Beware that this function needs to check if at least one of the sets is nonempty.

In [3]:
def jaccard_sim(s1, s2):
    assert s1 or s2 , "One of the set must contain elements"
    return (len(set(s1) & set(s2)) / len(set(s1) | set(s2)) )

In [4]:
jaccard_sim({2,5}, {5,2})

1.0

**23**

Construct a function that computes Jaccard similarity for two strings treated as bags of words.

In [5]:
from functools import reduce
def jacc2(s1, s2):
    splited1 = [word.lower() for word in s1.split(" ")]
    splited2 = [word.lower() for word in s2.split(" ")]
    freq1 = {i:splited1.count(i) for i in set(splited1)}
    freq2 = {i:splited2.count(i) for i in set(splited2)}
    inter = 0
    for i in set(splited1) & set(splited2):
        inter += min(freq1[i], freq2[i])
    
    print(inter, set(splited1) & set(splited2))
    print(set(splited1) | set(splited2))
    return inter/(len(splited1) + len(splited2))

In [6]:
jacc2("Witam Pana Michała.", "Witam PaNa witolda i Pana Michała.")

3 {'pana', 'witam', 'michała.'}
{'michała.', 'pana', 'i', 'witolda', 'witam'}


0.3333333333333333

**24**

(use NLTK) List all words in `text1` with edit distance from the word `dog` smaller than 4. Hint: you can safely reject all long words without computations (why?)

In [3]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
from nltk.book import *
from nltk.corpus import stopwords

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [6]:
stop_words = set(stopwords.words('english'))
print(
    len(
        [token for token in list(filter(lambda x: len(x) < 7 and x not in stop_words, text1)) 
         if nltk.edit_distance(token, "dog") < 4]
    )
)
print(
    len(
        [token for token in list(filter(lambda x: len(x) < 7, text1)) 
         if nltk.edit_distance(token, "dog") < 4]
    )
)

69100
148344


**25**

(use NLTK) Let `text1-text9` be bags of words. Compute similarity between all pairs of texts.

In [7]:
from itertools import combinations

for t1, t2 in combinations([text1, text2, text3, text4, text5, text6, text7, text8, text9], 2):
    print(f"{t1.name} - {t2.name}: {nltk.jaccard_distance(set(filter(lambda x: x not in stop_words,t1)), set(filter(lambda x: x not in stop_words, t2)))}")
    

Moby Dick by Herman Melville 1851 - Sense and Sensibility by Jane Austen 1811: 0.7873816443236149
Moby Dick by Herman Melville 1851 - The Book of Genesis: 0.918655312407169
Moby Dick by Herman Melville 1851 - Inaugural Address Corpus: 0.7596932621058073
Moby Dick by Herman Melville 1851 - Chat Corpus: 0.9021990993748087
Moby Dick by Herman Melville 1851 - Monty Python and the Holy Grail: 0.9335110395815521
Moby Dick by Herman Melville 1851 - Wall Street Journal: 0.8320154434421057
Moby Dick by Herman Melville 1851 - Personals Corpus: 0.9724919916611583
Moby Dick by Herman Melville 1851 - The Man Who Was Thursday by G . K . Chesterton 1908: 0.782489869003864
Sense and Sensibility by Jane Austen 1811 - The Book of Genesis: 0.8690246257846451
Sense and Sensibility by Jane Austen 1811 - Inaugural Address Corpus: 0.7114655716993051
Sense and Sensibility by Jane Austen 1811 - Chat Corpus: 0.8689815643458028
Sense and Sensibility by Jane Austen 1811 - Monty Python and the Holy Grail: 0.883390

**26**

(use NLTK) Let us consider a metric space $(S, d)$, where $S$ is the set of words from `text1` and $d$ is the Hamming distance. Find diameter of $(S, d)$.

A metric space $M$ is called bounded if there exists some number $r$, such that $d(x,y) ≤ r$ for all $x$ and $y$ in $M$. The smallest possible such $r$ is called the diameter of $M$.

In [11]:
from collections import defaultdict

def find_diameter(text):
    d = defaultdict(list)
    for i in set([t.lower() for t in text]):
        d[len(i)].append(i)

    d = sorted(d.items(), reverse=True)

    r = (0, "", "")
    control = False
    for i in range(len(d)):
        word_length = d[i][0]
        value = d[i][1]
        for s1, s2 in combinations(value,2):
            dist = sum([ex != ey for ex, ey in zip(s1, s2)])
            r = (dist, s1, s2) if dist > r[0] else r
        if i+1 < len(d) and r[0] >= d[i+1][0]:
            break
            
    return r

In [12]:
list(map(find_diameter, [text1, text2, text3, text4, text5]))

[(17, 'preternaturalness', 'cannibalistically'),
 (16, 'disqualifications', 'companionableness'),
 (14, 'interpretations', 'zaphnathpaaneah'),
 (17, 'antiphilosophists', 'misrepresentation'),
 (31, 'wooooooooooooohoooooooooooooooo', ')))))))))))))))))))))))))))))))')]

**27**

(use NLTK) Construct a dictionary that assigns each pair of consecutive words in `text1` the Jaccard similarity between them.

In [13]:
zad27 = {(t1,t2):nltk.jaccard_distance(set(t1), set(t2)) for t1, t2 in zip(text1, text1[1:])}
list(zad27)[:5]

[('[', 'Moby'),
 ('Moby', 'Dick'),
 ('Dick', 'by'),
 ('by', 'Herman'),
 ('Herman', 'Melville')]

**28**

(use NLTK). For two words $v$ and $w$, let `relative edit distance` be the Levensthein distance between $v$ and $w$ divided by the sum of lengths $v$ and $w$. Find two different words in `text2` with minimal relative edit distance

In [23]:
def relative_edit(a, b):
    return nltk.edit_distance(a, b) / (len(a) + len(b))

d = set([i.lower() for i in text2 if i.lower() not in stop_words])
print(len(d))

def min_rel():
    for s1, s2 in combinations(d, 2):
        dist = relative_edit(s1, s2)
        yield dist

6268


**29**

For a given bitstring **b** and a natural number *r* list all bitstrings __b’__, such that theHamming distance between `b` and `b’` is equal _n_.

**30**

Construct a function that for a given string and a natural number _k_ returns a **set** of all its *k*-shingles