# Практическая работа, модуль 18, тема 5
# NLP

#### Цели работы

* Освоить интерфейс TfidfVectorizer.
* Закрепить навыки токенизации текста.
* Закрепить навыки удаления стоп-слов и лемматизации текста.
* Научиться векторизовать текст.

Сдавать на провекру выполненную работу не нужно.  

К занятию приложен файл `vacansies.csv`, который содержит описание вакансий одной большой IT-компании. Скачайте его себе на компьютер.

### Что нужно сделать
В файле `vacansies.csv` — сотни разных вакансий, некоторые из них — похожи. Найдите вакансию с наибольшим числом похожих вакансий. Считайте, что вакансии похожи, если косинусное расстояние между векторами, которые представлют их тексты, меньше `0.5`.

##### Шаг 1
Считайте файл `vacansies.csv` в Pandas-dataframe `df`.

##### Шаг 2
Напишите функцию `preprocess`, которая:
1. Принимает текст с описанием вакансии в качестве аргумента.
1. Токенизирует его (в данном случае токен — это отдельное слово). Обратите внимание, что описания вакансий организованы сложнее, чем тексты, с которыми вы работали в модуле. В текстах вакансий есть знаки препинания, переносы строк, emoji и прочее, поэтому просто использовать функцию [split](https://docs.python.org/3/library/stdtypes.html#str.split) не получится. Рекомендуем взять из NLTK [RegexpTokenizer](https://www.nltk.org/api/nltk.tokenize.RegexpTokenizer.html), который токенизирует текст с помощью регулярного выражения: всё, что ему удовлетворяет, считается токеном.
1. Удаляет из множества токенов (слов) стоп-слова.
1. Приводит каждый токен (слово) к нормальной форме (лемме). В модуле вы использовали стеммер Портера из NLTK, теперь попробуйте для разнообразия [MorphAnalyzer](https://pymorphy2.readthedocs.io/en/stable/misc/api_reference.html#pymorphy2.analyzer.MorphAnalyzer) из pymorphy2. Эта библиотека не обновляется уже продолжительное время, но всё ещё активно используется. Воспользуйтесь [способом установки старых версий](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fcodingwithfun.com%2Fpip%2Fpymorphy2%2F528998%2F).
1. Возвращает предобработанный текст, который состоит из токенов (слов) в нормальной форме и не содержит стоп-слов.

##### Шаг 3
Создайте экземпляр [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), передайте ему вашу функцию `preprocess`. Нормализацию не используйте.
##### Шаг 4
Пропустите датафрейм `df` с текстами вакансий через TfidfVectorizer, а затем, как мы делали в модуле, создайте датафрем `result` на основе того, что вернёт векторизатор. Если всё сделано правильно, столбцы этого датафрейма — это слова, строки — документы, значения в ячейках — метрика TF-IDF для данного слова в данном документе.
##### Шаг 5
Рассчитайте косинусное расстояние для всех векторов корпуса попарно. Используйте функцию [cosine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html). Результат сохраните в датафрейм `dist`.
##### Шаг 6
Используя функции pandas, найдите вакансию с максимальным количеством похожих на неё вакансий. Подсказка: вектор косинусных расстояний у такой вакансии должен иметь большее всего элементов, значения которых меньше `0.5`.

In [1]:
import pandas as pd

#
# Ваш код здесь.
#
df = pd.read_csv('vacancies.csv')

In [60]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk import RegexpTokenizer
import re
import pymorphy3


In [57]:
def preprocess(text):
    
    russian_stopwords = stopwords.words('russian')
    tokenizer = RegexpTokenizer(' ', text)
    stemmer = pymorphy3.MorphAnalyzer()
    
    tst = tokenizer.tokenize(text)
    
    indexes = []
    split_words = []
    words = []
    tokens = []
    
    for i in range(len(tst)):
        tst[i] = tst[i].lower()
        if len(tst[i].split('\n\n'))>= 2:
            indexes.append(i)
            for k in range(len(tst[i].split('\n\n'))):
                split_words.append(tst[i].split('\n\n')[k])

    indexes.reverse()

    for i in indexes:
        tst.pop(i)
    tst += split_words

    for i in range(len(tst)):
        tst[i] = tst[i].replace('\n', '')
        tst[i] = tst[i].replace(':', '')
        tst[i] = tst[i].replace(',', '')
        tst[i] = tst[i].replace('.', '')
        tst[i] = tst[i].replace('•', '')

    for token in tst:
        if token not in russian_stopwords:
            words.append(token)

    tokens = [stemmer.parse(word)[0].normal_form for word in words]


    return ' '.join(tokens)
                      

In [61]:
vectorizer = TfidfVectorizer(
    preprocessor=preprocess,
    norm=None
)

In [63]:
tfidf_matrix = vectorizer.fit_transform(df.text)

result = pd.DataFrame(
    data=tfidf_matrix.toarray(),
    columns=vectorizer.get_feature_names_out()
)


In [67]:
result

Unnamed: 0,000,06jjq6ru3cyxcp,0a0a1tvvgd6qw,0h4wwufcxmp3u,0ihk9s7cecxzn,0l4_0nwk3uwaku,0lrks90zuaid4,10,100,1000,...,яндекство,яндекстелемост,яндекстолока,яндекстолочь,яндексуслуга,яндексучебник,яндексфлоу,яндексэфир,яп,ящик
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
620,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
621,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
622,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
623,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [71]:
from sklearn.metrics.pairwise import cosine_distances

distances = cosine_distances(result)

distances

distances = pd.DataFrame(distances)

In [72]:
distances

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,615,616,617,618,619,620,621,622,623,624
0,0.000000,0.857083,0.885023,0.890016,0.880650,0.924460,0.857513,0.005409,0.934043,0.739436,...,0.963600,0.961049,0.945595,0.932850,0.956056,0.843370,0.894783,0.947295,0.942791,0.942791
1,0.857083,0.000000,0.848584,0.878319,0.927624,0.917723,0.872263,0.856306,0.882827,0.946203,...,0.930580,0.932270,0.942822,0.968280,0.941157,0.864988,0.928965,0.921390,0.942020,0.942020
2,0.885023,0.848584,0.000000,0.787945,0.953686,0.895959,0.914341,0.903823,0.922369,0.955808,...,0.855249,0.751527,0.915491,0.809305,0.955590,0.943629,0.941501,0.948747,0.917369,0.917369
3,0.890016,0.878319,0.787945,0.000000,0.949716,0.926521,0.917099,0.889418,0.958250,0.954829,...,0.881056,0.922370,0.920190,0.973997,0.948735,0.935549,0.950759,0.965267,0.972001,0.972001
4,0.880650,0.927624,0.953686,0.949716,0.000000,0.911921,0.918140,0.880001,0.935451,0.928548,...,0.971629,0.974731,0.962848,0.932927,0.965220,0.891436,0.912321,0.964317,0.973662,0.973662
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
620,0.843370,0.864988,0.943629,0.935549,0.891436,0.921456,0.881028,0.842518,0.929075,0.897872,...,0.955930,0.935507,0.918817,0.661758,0.959309,0.000000,0.908034,0.896373,0.930284,0.930284
621,0.894783,0.928965,0.941501,0.950759,0.912321,0.878300,0.872594,0.894211,0.934551,0.915428,...,0.914176,0.875050,0.940801,0.853032,0.897170,0.908034,0.000000,0.924590,0.911905,0.911905
622,0.947295,0.921390,0.948747,0.965267,0.964317,0.927723,0.931970,0.947009,0.969154,0.953151,...,0.954236,0.945505,0.936159,0.914061,0.943127,0.896373,0.924590,0.000000,0.334823,0.334823
623,0.942791,0.942020,0.917369,0.972001,0.973662,0.953692,0.943847,0.948382,0.964834,0.961815,...,0.952617,0.913208,0.941777,0.832929,0.968183,0.930284,0.911905,0.334823,0.000000,0.000000


In [100]:
distances = distances.drop('similar_vacs', axis=1)

In [101]:
similar_vacs = []
for col in distances.columns:
    similar_vacs.append(len(distances[distances[col] <= 0.5]))
    

In [102]:
distances['similar_vacs'] = similar_vacs

In [103]:
answers = distances[distances.similar_vacs == 8].index

In [104]:
df.loc[ answers,]

Unnamed: 0,text
143,Разработчик-аналитик машинного обучения в Еду🍎...
283,Разработчик-аналитик ML в Лавку🍋\n\nЯндекс.Лав...


## Тема 5. Решение


In [None]:
import pandas as pd

### ШАГ 1
df = pd.read_csv('data/skillbox/ml_jun18/vacancies.csv')

df.head()

Unnamed: 0,text
0,Старший Java-разработчик в Музыку🎧\n\nВас ждет...
1,Python-разработчик в Яндекс.Лавку🍔\n\nЯндекс.Л...
2,Фронтенд-разработчик в Вертикали🏠\n\nВертикали...
3,iOS-разработчик в Вертикали (Буткемп)🍏\n\nВерт...
4,Старший разработчик в группу разработки бессер...


In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import pymorphy2

tokenizer = RegexpTokenizer('\w+')
russian_stopwords = stopwords.words('russian')
morph = pymorphy2.MorphAnalyzer()

### ШАГ 2
def preprocess(text):
    stemmed_words = []
    for word in tokenizer.tokenize(text):
        word = word.lower()
        if word not in russian_stopwords:
            stemmed_words.append(morph.parse(word)[0].normal_form)
    return ' '.join(stemmed_words)

### ШАГ3
vectorizer = TfidfVectorizer(
    preprocessor=preprocess,
    norm=None
)

### ШАГ 4
tfidf_matrix = vectorizer.fit_transform(df['text'])

result = pd.DataFrame(
    data=tfidf_matrix.toarray(),
    columns=vectorizer.get_feature_names_out()
)

result

Unnamed: 0,000,06jjq6ru3cyxcp,0a0a1tvvgd6qw,0h4wwufcxmp3u,0ihk9s7cecxzn,0l4_0nwk3uwaku,0lrks90zuaid4,10,100,1000,...,юридический,явление,являться,ядро,язык,языковой,яндекс,яндекс360,яп,ящик
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.635329,0.0,2.741850,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,2.741850,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,1.370925,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
620,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.635329,0.0,1.370925,0.0,0.0,0.0
621,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,1.370925,0.0,0.0,0.0
622,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,1.370925,0.0,0.0,0.0
623,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,1.370925,0.0,0.0,0.0


In [None]:
from sklearn.metrics.pairwise import cosine_distances

### ШАГ 5
distances = cosine_distances(result)

dist = pd.DataFrame(distances)

dist.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,615,616,617,618,619,620,621,622,623,624
0,0.0,0.867277,0.892791,0.898116,0.890252,0.924523,0.863116,0.004905,0.937734,0.744903,...,0.969723,0.962532,0.948985,0.937083,0.958178,0.85299,0.901102,0.950424,0.940666,0.940666
1,0.867277,0.0,0.850028,0.883378,0.932432,0.91026,0.872761,0.866623,0.851844,0.941676,...,0.942066,0.930739,0.945752,0.96684,0.940882,0.868696,0.930068,0.851514,0.873019,0.873019
2,0.892791,0.850028,0.0,0.705472,0.956243,0.895629,0.910963,0.910157,0.919442,0.954458,...,0.782189,0.694877,0.892991,0.798895,0.957063,0.945965,0.944117,0.950632,0.916042,0.916042
3,0.898116,0.883378,0.705472,0.0,0.953029,0.922851,0.918541,0.897613,0.956812,0.953168,...,0.784778,0.886813,0.906069,0.974369,0.950082,0.937717,0.952347,0.966035,0.972255,0.972255
4,0.890252,0.932432,0.956243,0.953029,0.0,0.911678,0.921108,0.889711,0.939244,0.929367,...,0.976503,0.975582,0.965005,0.937016,0.966852,0.897681,0.917412,0.966232,0.971561,0.971561


In [None]:
### ШАГ 6
dist.apply(lambda x: x[x < 0.5].count()).idxmax()

df.iloc[143]

text    Разработчик-аналитик машинного обучения в Еду🍎...
Name: 143, dtype: object

In [None]:
### Дополнительно: все похожие вакансии.
dist.iloc[143][dist.iloc[143]<0.5]

143    0.000000
221    0.197556
283    0.371167
293    0.415468
359    0.427564
365    0.404273
485    0.414981
486    0.414981
Name: 143, dtype: float64

In [None]:
### Например
df.iloc[221]

text    Аналитик ML в Еду🍇\n\nЯндекс.Еда — быстро раст...
Name: 221, dtype: object