**Примеры упражнений по английскому языку**
- https://practicum.yandex.ru/english/app/cust-dev/promo/space-chapter-a
- https://practicum.yandex.ru/english/app/cust-dev/promo/space-chapter-b

**Полезные материалы**

Как в gensim использовать готовые векторы, полезные методы, most_similar, similarity, distance, doesnt_match
https://radimrehurek.com/gensim/models/keyedvectors.html
(у spacy тоже есть most_similar, но реализация в gensim намного проще и понятнее)

Какие есть готовые модели и векторы в gensim, как посмотреть, загрузить
https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html
Предлагаю glove на основе википедии

Какие есть готовые модели spacy
https://spacy.io/models/en
самая популярная en_core_web_sm маленькая и легкая на основе веб-текстов, включает словарь, синтаксис, именованные сущности

Части речи в spacy развернутый пример
https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/

Список тэгов частей речи
https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk

Все языковые возможности spacy
https://spacy.io/usage/linguistic-features
Разбор предложений на подлежащее/сказуемое/дополнение/обстоятельство можно посмотреть в разделе Dependency Parsing

Модуль для трансформации слов (например времена глаголов, множественное число)
https://pypi.org/project/pyinflect/
и как его подружить со spacy
https://spacy.io/universe/project/pyInflect


In [1]:
import json

import numpy as np
import pandas as pd

import spacy
import en_core_web_sm
import pyinflect

import gensim.downloader as api


In [2]:
# малая модель spacy
nlp = en_core_web_sm.load()

# малая модель glove wiki
# внимание - очень долго скачивает, если она еще не установлена
model = api.load("glove-wiki-gigaword-100")

In [13]:
# пример датасета – как упаковать упражнения

df = pd.DataFrame(columns=['raw', 'type', 'object', 'options', 'answer', 'description'])

df.loc[len(df)] = {'raw' : 'Once upon a time there was a young fellow who enlisted as a soldier, conducted himself bravely, and was always at the very front when it was raining bullets.',
                   'type' : 'select_word',
                   'object' : 'raining',
                   'options' : ['snowing', 'rained', 'raining'],
                   'answer' : 'raining',
                   'description' : 'Выберите слово'
                  }

df.loc[len(df)] = {'raw' : 'His parents were dead, and he had no longer a home, so he went to his brothers and asked them to support him until there was another war.',
                   'type' : 'missing_word',
                   'object' : 'longer',
                   'options' : [],
                   'answer' : 'longer',
                   'description' : 'Заполните пропуск'
                  }

df.loc[len(df)] = {'raw' : 'We have no work for you.',
                   'type' : 'select_sent',
                   'object' : 'We have no work for you.',
                   'options' : ['We have no work for you.',
                                'We had no works for you.',
                                'We been no done for you.'],
                   'answer' : 'We have no work for you.',
                   'description' : 'Какое предложение верно?'
                  }

df.loc[len(df)] = {'raw' : 'The poor bride-to-be dressed herself entirely in black, and when she thought about her future bridegroom, tears came into her eyes.',
                   'type' : 'noun_phrases',
                   'object' : 'her future bridegroom',
                   'options' : ['nominal subject (passive)',
                                'nominal subject',
                                'object of preposition'],
                   'answer' : 'object of preposition',
                   'description' : 'Чем является  выделенная фраза?'
                  }

df.loc[len(df)] = {'raw' : 'Sentence without exercises',
                  }

df

Unnamed: 0,raw,type,object,options,answer,description
0,Once upon a time there was a young fellow who ...,select_word,raining,"[snowing, rained, raining]",raining,Выберите слово
1,"His parents were dead, and he had no longer a ...",missing_word,longer,[],longer,Заполните пропуск
2,We have no work for you.,select_sent,We have no work for you.,"[We have no work for you., We had no works for...",We have no work for you.,Какое предложение верно?
3,The poor bride-to-be dressed herself entirely ...,noun_phrases,her future bridegroom,"[nominal subject (passive), nominal subject, o...",object of preposition,Чем является выделенная фраза?
4,Sentence without exercises,,,,,


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   raw          5 non-null      object
 1   type         4 non-null      object
 2   object       4 non-null      object
 3   options      4 non-null      object
 4   answer       4 non-null      object
 5   description  4 non-null      object
dtypes: object(6)
memory usage: 280.0+ bytes


In [15]:
df.to_csv('sample_df.csv', index=False)

# Similar words

In [4]:
# сходные слова, синонимы
model.similar_by_word('fast')

[('slow', 0.7959730625152588),
 ('faster', 0.7511823177337646),
 ('pace', 0.7462931871414185),
 ('speed', 0.7133392691612244),
 ('quick', 0.7107294797897339),
 ('easy', 0.6889646649360657),
 ('better', 0.6753882169723511),
 ('slower', 0.673578143119812),
 ('way', 0.6688982248306274),
 ('moving', 0.666520357131958)]

In [7]:
# антонимы – добавляем пару позитив-негатив с противоположными значениями
model.most_similar(positive=['fast','bad'], negative=['good'])

[('slow', 0.7502553462982178),
 ('slower', 0.6295009851455688),
 ('faster', 0.6158817410469055),
 ('too', 0.5972148180007935),
 ('turning', 0.5882929563522339),
 ('off', 0.5874745845794678),
 ('dangerous', 0.5860161185264587),
 ('worse', 0.5812638998031616),
 ('trouble', 0.5808587074279785),
 ('heavy', 0.5680885910987854)]

In [9]:
# фильтрация стоп слов с помощью спейси
word = 'fast'
antonyms = model.most_similar(positive=[word,'bad'], negative=['good'])
# get words from tuples
antonyms = [ _[0] for _ in antonyms]
# filter stop words
antonyms = [_.text for _ in nlp(' '.join(antonyms)) if not _.is_stop]
print('Потенциальные антонимы', word)
antonyms   

Потенциальные антонимы fast


['slow',
 'slower',
 'faster',
 'turning',
 'dangerous',
 'worse',
 'trouble',
 'heavy']

In [11]:
# проверка similarity двух слов с помощью gensim
for ant in antonyms:
    print(ant, model.similarity('fast', ant))


slow 0.79597306
slower 0.6735782
faster 0.7511823
turning 0.64884084
dangerous 0.55738723
worse 0.49056715
trouble 0.5541388
heavy 0.5079474


In [144]:
# как посчитать вручную similarity
# можно взять из gensim вектор слова и посчитать косинусное расстояние
fast_vec = model['fast']
slow_vec = model['slow']
cosine_similarity = (fast_vec @ slow_vec)/(np.linalg.norm(fast_vec)*np.linalg.norm(slow_vec))
cosine_similarity


0.79597306

# Sentence transformation

In [18]:
# заменим существительные, глаголы, причастия и прилагательные
# на случайные близкие слова и анти-слова
sent = 'Where are you going so early, Little Red Cap?'
new_sent_1, new_sent_2 = sent, sent
i=5
for token in nlp(sent):
    if token.pos_ in ['NOUN', 'VERB', 'ADV', 'ADJ']:
        # иногда слово не находится, поэтому через try-except
        try:
            m, n = np.random.randint(0, i, 2)
            
            new_word_1 = model.most_similar(token.text.lower(), topn=i)[m][0]
            new_word_2 = model.most_similar(positive = [token.text.lower(), 'bad'],
                                            negative = ['good'],
                                            topn=i)[n][0]

            new_word_1 = new_word_1.title() if token.text.istitle() else new_word_1
            new_word_2 = new_word_2.title() if token.text.istitle() else new_word_2
            
            new_sent_1 = new_sent_1.replace(token.text, new_word_1)
            new_sent_2 = new_sent_2.replace(token.text, new_word_2)
        except:
            pass

print(sent)
print(new_sent_1)
print(new_sent_2)


Where are you going so early, Little Red Cap?
Where are you n't but during, Bit Red Cap?
Where are you gone because late, Big Red Cap?


# Inflecting

In [19]:
# изменение степени прилагательного с помощью pyinflect
for token in nlp("I think it's a good idea and easy to use"):
    if token.pos_=='ADJ':
        print(token.text, token._.inflect('JJS'))          

good best
easy easiest


In [116]:
# изменение формы глагола с помощью pyinflect
for token in nlp("I think it's a good idea and easy to use"):
    if token.pos_=='VERB':
        print(token._.inflect('VBP'))
        print(token._.inflect('VBZ'))
        print(token._.inflect('VBG'))
        print(token._.inflect('VBD'))


think
thinks
thinking
thought
use
uses
using
used


# Morphology

In [83]:
# морфология – части речи и формы слов
for token in nlp("I think it's a good idea and easy to use"):
    print(token.text, '\t–\t', token.morph) 


I 	–	 Case=Nom|Number=Sing|Person=1|PronType=Prs
think 	–	 Tense=Pres|VerbForm=Fin
it 	–	 Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs
's 	–	 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
a 	–	 Definite=Ind|PronType=Art
good 	–	 Degree=Pos
idea 	–	 Number=Sing
and 	–	 ConjType=Cmp
easy 	–	 Degree=Pos
to 	–	 
use 	–	 VerbForm=Inf


# Dependency

In [87]:
# разбор предложения по зависимостям
for token in nlp("All the necessary ingredients for a pizza arrived in the next delivery"):
    print(token.text, ':', token.dep_)


All : predet
the : det
necessary : amod
ingredients : nsubj
for : prep
a : det
pizza : pobj
arrived : ROOT
in : prep
the : det
next : amod
delivery : pobj


In [20]:
# существительные с зависимыми словами
for chunk in nlp("All the necessary ingredients for a pizza arrived in the next delivery").noun_chunks:
    print(chunk.text, ':', 
          chunk.root.text, ':', 
          chunk.root.dep_, len(chunk), ':', 
          spacy.explain(chunk.root.dep_), ':', 
          chunk.root.head.text)


All the necessary ingredients : ingredients : nsubj 4 : nominal subject : arrived
a pizza : pizza : pobj 2 : object of preposition : for
the next delivery : delivery : pobj 3 : object of preposition : in


# Gensim models and vectors


In [10]:
# Что еще интересного можно посмотреть в gensim
info = api.info()
print(json.dumps(info, indent=4))


{
    "corpora": {
        "semeval-2016-2017-task3-subtaskBC": {
            "num_records": -1,
            "record_format": "dict",
            "file_size": 6344358,
            "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
            "license": "All files released for the task are free for general research use",
            "fields": {
                "2016-train": [
                    "..."
                ],
                "2016-dev": [
                    "..."
                ],
                "2017-test": [
                    "..."
                ],
                "2016-test": [
                    "..."
                ]
            },
            "description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collect

In [90]:
# например модели, обученные на разных источниках.
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )


__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets,...
glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 (3000000 records): Pre-trai

In [11]:
for model_name, model_data in sorted(info['corpora'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )


20-newsgroups (18846 records): The notorious collection of approximatel...
__testing_matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...
__testing_multipart-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...
fake-news (12999 records): News dataset, contains text and metadata...
patent-2017 (353197 records): Patent Grant Full Text. Contains the ful...
quora-duplicate-questions (404290 records): Over 400,000 lines of potential question...
semeval-2016-2017-task3-subtaskA-unannotated (189941 records): SemEval 2016 / 2017 Task 3 Subtask A una...
semeval-2016-2017-task3-subtaskBC (-1 records): SemEval 2016 / 2017 Task 3 Subtask B and...
text8 (1701 records): First 100,000,000 bytes of plain text fr...
wiki-english-20171001 (4924894 records): Extracted Wikipedia dump from October 20...
