# Spacy

# Home task

1. Using a Spacy, create a keywords extractor that should do the following things:
 - Take some text (article like) as an input.
 - Remove all stop words from the text.
 - Extract all the Nouns from text and sort them by count and return in descending order with amount of occurrences. 
 - Extract all the Verbs from text and sort them by count and return in descending order with amount of occurrences.  
 - Extract all the Numbers from text and sort them by count and return in descending order with amount of occurrences. 
 - Extract all the Named Entities from the text, group them into 4 groups (Location, Person, Organization, Misc.) and return groups in descending order with amount of occurrences. 


2. Using multilingual USE, align strings in English and Russian texts:
 - Download multilingual USE model - https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
 - Read "./data/corpora/en.txt" and "./data/corpora/ru.txt" files
 - Align English strings with their Russian analogues using mUSE
 
 
3. Using the USE, create a Duplicate Phrase Finder that will do the following:
 - Take some large text as an input.
 - Separates text to SENTENCES (phrases). 
 - Finds semantically similar strings (cosine similarity >=0.80)

In [40]:
import spacy
import gzip
import gensim
nlp = spacy.load("en_core_web_lg")
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np

In [2]:
text = '''
Apple Inc. is an American multinational technology company that specializes in consumer electronics, computer software, and online services. Apple is the world's largest technology company by revenue (totaling $274.5 billion in 2020) and, since January 2021, the world's most valuable company. As of 2021, Apple is the world's fourth-largest PC vendor by unit sales,[9] and fourth-largest smartphone manufacturer.[10][11] It is one of the Big Five American information technology companies, along with Amazon, Google, Microsoft, and Facebook.[12][13][14]

Apple was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976 to develop and sell Wozniak's Apple I personal computer. It was incorporated by Jobs and Wozniak as Apple Computer, Inc. in 1977, and sales of its computers, including the Apple II, grew quickly. They went public in 1980 to instant financial success. Over the next few years, Apple shipped new computers featuring innovative graphical user interfaces, such as the original Macintosh, announced with the critically acclaimed advert "1984". However, the high price of its products and limited application library caused problems, as did power struggles between executives. In 1985, Wozniak departed Apple amicably,[15] while Jobs resigned to found NeXT, taking some Apple co-workers with him.[16]

As the market for personal computers expanded and evolved through the 1990s, Apple lost considerable market share to the lower-priced duopoly of Microsoft Windows on Intel PC clones. The board recruited CEO Gil Amelio, who prepared the struggling company for eventual success with extensive reforms, product focus and layoffs in his 500 day tenure. In 1997, Gil bought NeXT, to resolve Apple's unsuccessful OS strategy and bring back Steve Jobs, who replaced Amelio as CEO later that year. Apple returned to profitability under the revitalizing "Think different" campaign, launching the iMac and iPod, opening a retail chain of Apple Stores in 2001, and acquiring numerous companies to broaden their software portfolio. In 2007, the company launched the iPhone to critical acclaim and financial success. In 2011, Jobs resigned as CEO due to health complications, and died two months later. He was succeeded by Tim Cook.

In August 2018, Apple became the first publicly traded U.S. company to be valued at over $1 trillion[17][18] and the first valued over $2 trillion two years later.[19][20] It has a high level of brand loyalty and is ranked as the world's most valuable brand; as of January 2021, there are 1.65 billion Apple products in use worldwide.[21] However, the company receives significant criticism regarding the labor practices of its contractors, its environmental practices, and business ethics, including anti-competitive behavior, and materials sourcing.
'''

In [3]:
input_file = "./data/reviews_data.txt.gz"

with gzip.open(input_file, 'rb') as f:
    for i,line in enumerate (f):
        ii = str(line)
        break

In [4]:
def key_ext(text):
    all_stopwords = nlp.Defaults.stop_words
    text_tokens = nlp.tokenizer(text)
    text_t= ' '.join([str(word) for word in text_tokens if not word in all_stopwords])       

    doc = nlp(text_t)


    noun = []
    for token in doc:
        if token.pos_ == 'NOUN':
            noun.append(token)

    verb = []
    for token in doc:
        if token.pos_ == 'VERB':
            verb.append(token)

    number = []
    for token in doc:
        if token.pos_ == 'NUM':
            number.append(token)

    propn = []
    for token in doc:
        # print(token.text, token.pos_, token.dep_, token.head.text)
        if token.pos_ == 'PROPN':
            propn.append(token)
    list_ext = [noun,verb,number,propn]
    return list_ext

In [48]:
l = (key_ext(text))

In [51]:
z = list(map(list, zip(*l)))

In [52]:
# df1 = pd.DataFrame(key_ext(text))
df1 = pd.DataFrame(z, columns=['NOUN','VERB','NUM','PROPN'])


In [56]:
df2 = pd.DataFrame(l, ['NOUN','VERB','NUM','PROPN'])

In [57]:
df2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,88,89,90,91,92,93,94,95,96,97
NOUN,technology,company,consumer,electronics,computer,software,services,world,technology,company,...,criticism,labor,practices,contractors,practices,business,ethics,behavior,materials,sourcing
VERB,specializes,totaling,founded,develop,sell,incorporated,including,grew,went,shipped,...,,,,,,,,,,
NUM,274.5,billion,2020,2021,2021,one,Five,1976,1977,1980,...,,,,,,,,,,
PROPN,Apple,Inc.,Apple,January,Apple,Amazon,Google,Microsoft,Facebook.[12][13][14,Apple,...,,,,,,,,,,


In [15]:
def show(text, i):
    for t in key_ext(text):
        print(len(t),t[:i])

In [8]:
show(text, 10)

98 [technology, company, consumer, electronics, computer, software, services, world, technology, company]
47 [specializes, totaling, founded, develop, sell, incorporated, including, grew, went, shipped]
26 [274.5, billion, 2020, 2021, 2021, one, Five, 1976, 1977, 1980]
62 [Apple, Inc., Apple, January, Apple, Amazon, Google, Microsoft, Facebook.[12][13][14, Apple]


In [None]:
show(ii, 10)

In [14]:
# print(spacy.explain("GPE"))


In [17]:
# Read "./data/corpora/en.txt" and "./data/corpora/ru.txt" files

en = []
ru = []
with open("./data/corpora/en.txt") as f:
    for line in f.readlines()[:50]:
        en.append(line.strip())
        
with open("./data/corpora/ru.txt") as f:
    for line in f.readlines()[:50]:
        ru.append(line.strip()) 

In [39]:
en

['How do you explain this progression?',
 "Cigarettes are linked to 85% of lung cancer cases, this massively damages people's health.",
 'Everything moves very fast in football',
 "You're never going to win 4-0 every weekend - we're not FC Barcelona!",
 'We got out of Afghanistan.',
 'French troops have left their area of responsibility in Afghanistan']

In [40]:
ru

['Курение связано с 85% случаев рака легких. Оно наносит колоссальный вред здоровью людей.',
 'В футболе все происходит очень быстро.',
 'Французские войска покинули свою зону ответственности в Афганистане',
 'Мы никогда не сможем выигрывать каждые выходные со счетом 4-0.',
 'Мы ушли из Афганистана.',
 'Как вы объясните этот рост?']