1. Using a Spacy, create a keywords extractor that should do the following things:
 - Take some text (article like) as an input.
 - Remove all stop words from the text.
 - Extract all the Nouns from text and sort them by count and return in descending order with amount of occurrences. 
 - Extract all the Verbs from text and sort them by count and return in descending order with amount of occurrences.  
 - Extract all the Numbers from text and sort them by count and return in descending order with amount of occurrences. 
 - Extract all the Named Entities from the text, group them into 4 groups (Location, Person, Organization, Misc.) and return groups in descending order with amount of occurrences. 


2. Using multilingual USE, align strings in English and Russian texts:
 - Download multilingual USE model - https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
 - Read "./data/corpora/en.txt" and "./data/corpora/ru.txt" files
 - Align English strings with their Russian analogues using mUSE
 
 
3. Using the USE, create a Duplicate Phrase Finder that will do the following:
 - Take some large text as an input.
 - Separates text to SENTENCES (phrases). 
 - Finds semantically similar strings (cosine similarity >=0.80)

In [35]:
import spacy

nlp = spacy.load("en_core_web_lg")

In [5]:
def get_document(input_file):
    with open(input_file) as f:
        article = f.read()
        #Output little part of article
        print(f'Article: \n {article[:500]} \n')
        
    return article


input_file = 'data/golf_article.txt'
doc = nlp(get_document(input_file))

In [59]:
def remove_stop_words(doc):
    all_stopwords = nlp.Defaults.stop_words
    doc_before_cleaning_lenght = len(doc)
    doc_without_stop_words = [token for token in doc if token.text not in all_stopwords]
    print('Amount of tokens before and after removing stop words: {}/{}'\
         .format(doc_before_cleaning_lenght, len(doc_without_stop_words)))
    return doc


doc = remove_stop_words(doc)

Amount of tokens before and after removing stop words: 1069/634


In [60]:
from collections import Counter

def get_pos(doc):
    nouns_list = []
    numbers_list = []
    verbs_list = []
    
    for token in doc:
        if token.pos_=='NOUN':
            nouns_list.append(token.lemma_)
        elif token.pos_=='NUM':
            numbers_list.append(token.lemma_)
        elif token.pos_=='VERB':
            verbs_list.append(token.lemma_)
            
    print('Most freq nouns tokens:\n {} \n'.format(Counter(nouns_list).most_common(10)))
    print('Most freq nouns tokens:\n {} \n'.format(Counter(numbers_list).most_common(10)))
    print('Most freq nouns tokens:\n {} \n'.format(Counter(verbs_list).most_common(10)))

In [61]:
get_pos(doc)

Most freq nouns tokens:
 [('golf', 28), ('ball', 14), ('hole', 14), ('player', 13), ('stroke', 11), ('golfer', 11), ('club', 10), ('par', 7), ('course', 6), ('number', 5)] 

Most freq nouns tokens:
 [('one', 6), ('two', 4), ('18', 3), ('500', 2), ('hundred', 1), ('1.62', 1), ('46', 1), ('14', 1), ('1', 1), ('9', 1)] 

Most freq nouns tokens:
 [('have', 13), ('play', 12), ('hit', 7), ('use', 7), ('make', 6), ('be', 5), ('call', 5), ('start', 5), ('finish', 3), ('take', 3)] 



In [129]:
def get_named_entities(doc):
    label_dict = {'ORG':[],
                  'PERSON':[],
                  'LOCATION':[],
                  'MISC':[]}
    sampling_of_location = ['LOC', 'GPE']
    
    for ent in doc.ents:
        if ent.label_ in label_dict.keys():
            label_dict[ent.label_].append(ent.text)
        elif ent.label_ in sampling_of_location:
            label_dict['LOCATION'].append(ent.text)
        else:
            label_dict['MISC'].append(ent.text)
            
    for group in label_dict.keys():
        print(f'{group}: \n {Counter(label_dict[group]).most_common(10)}\n')


get_named_entities(doc)

ORG: 
 [('Golf\n \nGolf', 1), ('Golf Equipment', 1), ('Par', 1), ('Eagle', 1), ('Royal  St. Andrews Golf Club', 1), ('Royal St. Andrews Golf Club - Gordon McKinlay\n \n', 1), ('PGA', 1), ('Severiano Ballesteros', 1)]

PERSON: 
 [('Birdie', 1), ('Saint Andrews', 1), ('Bernhard Langer', 1), ('Gary Player', 1), ('Tiger Woods', 1)]

LOCATION: 
 [('Scotland', 1), ('Britain', 1), ('Calcutta', 1), ('India', 1), ('USA', 1), ('Canada', 1), ('Spain', 1), ('Germany', 1), ('South Africa', 1), ('Europe', 1)]

MISC: 
 [('first', 4), ('18', 3), ('two', 3), ('one', 3), ('Today', 2), ('British', 2), ('American', 2), ('Millions', 1), ('thousands', 1), ('millions', 1)]



In [102]:
for i, ent in enumerate(doc.ents):
    if i==10:
        break
    print(ent.label_)

ORG
CARDINAL
CARDINAL
CARDINAL
CARDINAL
CARDINAL
CARDINAL
ORDINAL
ORG
QUANTITY


### Task 2

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import tensorflow_text

In [2]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

In [10]:
en = []
ru = []
with open("./data/corpora/en.txt") as f:
    for line in f.readlines()[:50]:
        en.append(line.strip())
        
with open("./data/corpora/ru.txt") as f:
    for line in f.readlines()[:50]:
        ru.append(line.strip()) 

In [11]:
en, ru

(['How do you explain this progression?',
  "Cigarettes are linked to 85% of lung cancer cases, this massively damages people's health.",
  'Everything moves very fast in football',
  "You're never going to win 4-0 every weekend - we're not FC Barcelona!",
  'We got out of Afghanistan.',
  'French troops have left their area of responsibility in Afghanistan'],
 ['Курение связано с 85% случаев рака легких. Оно наносит колоссальный вред здоровью людей.',
  'В футболе все происходит очень быстро.',
  'Французские войска покинули свою зону ответственности в Афганистане',
  'Мы никогда не сможем выигрывать каждые выходные со счетом 4-0.',
  'Мы ушли из Афганистана.',
  'Как вы объясните этот рост?'])

In [44]:
def get_similarity_of_sentence(sent1, sent2):
    message = [sent1, sent2]
    en_emb, ru_emb = embed(message)
    similarity_of_sentences = np.inner(en_emb, ru_emb)
    
    return similarity_of_sentences


for i, en_sentence in enumerate(en):
    for j, ru_sentence in enumerate(ru):
        similarity = get_similarity_of_sentence(en_sentence, ru_sentence)
        if (similarity>.6) and (i!=j):
            ru[i], ru[j] = ru[j], ru[i]

In [13]:
en, ru

(['How do you explain this progression?',
  "Cigarettes are linked to 85% of lung cancer cases, this massively damages people's health.",
  'Everything moves very fast in football',
  "You're never going to win 4-0 every weekend - we're not FC Barcelona!",
  'We got out of Afghanistan.',
  'French troops have left their area of responsibility in Afghanistan'],
 ['Как вы объясните этот рост?',
  'Курение связано с 85% случаев рака легких. Оно наносит колоссальный вред здоровью людей.',
  'В футболе все происходит очень быстро.',
  'Мы никогда не сможем выигрывать каждые выходные со счетом 4-0.',
  'Мы ушли из Афганистана.',
  'Французские войска покинули свою зону ответственности в Афганистане'])

### Task 3

In [12]:
from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
import nltk

In [38]:
input_file = 'data/Sir-Edwin-Landseer-Frederick-G--St.txt'
doc = get_document(input_file)
doc_sentences = sent_tokenize(doc)
doc_sentences[len(doc_sentences)-1]

Article: 
 So much of the family history of this artist as it is needful to repeat,
or the reader will care to learn, may be briefly told: it begins with
his grandfather, who was a jeweller settled in London, where, in
1761,[2] his father, John Landseer, was born. The senior was on intimate
terms with Peter, father of the lawyer and politician, Sir Samuel
Romilly. Peter Romilly was descended from a distinguished French family,
the first of whom known in this country settled near London after the
revocation 



'It would have been better for\\nReynolds’s reputation if he had restricted\nhimself to that mode of art\\nin which he was a master.'

In [53]:
def DuplicatePhraseFinder(doc):
    doc_sentences = sent_tokenize(doc)

    for i, sent1 in enumerate(doc_sentences):
        for j, sent2 in enumerate(doc_sentences):
            if len(sent1)>5:
                if (get_similarity_of_sentence(sent1, sent2)>.8) and (i!=j):
                    print(f'id: {i}, Sentence: ({sent1})\nSimilar to \nid:{j} Sentence: ({sent2})\n\n')

In [54]:
DuplicatePhraseFinder(doc)

id: 30, Sentence: (It would have been better for
Reynolds’s reputation if he had restricted himself to that mode of art
in which he was a master.)
Similar to 
id:40 Sentence: (It would have been better for\nReynolds’s reputation if he had restricted
himself to that mode of art\nin which he was a master.)


id: 40, Sentence: (It would have been better for\nReynolds’s reputation if he had restricted
himself to that mode of art\nin which he was a master.)
Similar to 
id:30 Sentence: (It would have been better for
Reynolds’s reputation if he had restricted himself to that mode of art
in which he was a master.)


