Для выполнения данной работы предлагается использовать библиотеку [transformers](https://pypi.org/project/transformers/)

## Задачи:
  1. Получить векторное представление слов с помощью transformers
  2. Обучить модель для классификации текстов вакансий по роли (professional_roles)
  3. Обучить модель для определения размера оплаты труда (от и до или среднее значение) 


## Материалы:
  - https://github.com/huggingface/transformers/tree/master/notebooks
  - https://huggingface.co/models?language=ru&sort=downloads

# Task 1

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
import numpy as np
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask
  
tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/sbert_large_nlu_ru")
model = AutoModelForMaskedLM.from_pretrained("sberbank-ai/sbert_large_nlu_ru")

Some weights of BertForMaskedLM were not initialized from the model checkpoint at sberbank-ai/sbert_large_nlu_ru and are newly initialized: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
df = pd.read_csv("vacancies.csv", index_col=0)
df

Unnamed: 0,vacancy_id,raw_json,prepared_description
0,77594629,"{'id': '77594629', 'premium': False, 'billing_...","[['обязанность', 'работа', 'архивом', 'закрыты..."
1,43551857,"{'id': '43551857', 'premium': False, 'billing_...","[['обязанность', 'работа', 'производственный',..."
2,8388609,"{'id': '8388609', 'premium': False, 'billing_t...","[['менеджер', 'counter', 'manager', 'бренд', '..."
3,10485773,"{'id': '10485773', 'premium': False, 'billing_...","[['обязанность', 'кредитование', 'юридический'..."
4,10485774,"{'id': '10485774', 'premium': False, 'billing_...","[['задача', 'организация', 'деятельность', 'уп..."
...,...,...,...
10066,71344604,"{'id': '71344604', 'premium': False, 'billing_...","[['обязанность', 'участие', 'трансформация', '..."
10067,73441763,"{'id': '73441763', 'premium': False, 'billing_...","[['фирменный', 'магазин', 'redmond', 'пойменны..."
10068,54567399,"{'id': '54567399', 'premium': False, 'billing_...","[['обязанность', 'руководить', 'работа', 'соде..."
10069,29401585,"{'id': '29401585', 'premium': False, 'billing_...","[['обслуживание', 'ремонт', 'электрооборудован..."


In [4]:
df1 = df.sample(frac=1, random_state=42, ignore_index=True)[:100]
df1

Unnamed: 0,vacancy_id,raw_json,prepared_description
0,33559176,"{'id': '33559176', 'premium': False, 'billing_...","[['присоединяться', 'крупный', 'европа', 'онла..."
1,48235903,"{'id': '48235903', 'premium': False, 'billing_...","[['международный', 'франчайзинговый', 'сеть', ..."
2,6325648,"{'id': '6325648', 'premium': False, 'billing_t...","[['обязанность', 'активный', 'привлечение', 'к..."
3,77599446,"{'id': '77599446', 'premium': False, 'billing_...","[['обязанность', 'разработка', 'стилевой', 'ре..."
4,48234655,"{'id': '48234655', 'premium': False, 'billing_...","[['обязанность', 'проведение', 'погрузочно', '..."
...,...,...,...
95,48252525,"{'id': '48252525', 'premium': False, 'billing_...","[['отель', 'кемпински', 'мойка', 'открыть', 'в..."
96,4235466,"{'id': '4235466', 'premium': False, 'billing_t...","[['секретарь', 'знание', 'английский', 'язык',..."
97,29400642,"{'id': '29400642', 'premium': False, 'billing_...","[['обязанность', 'организовывать', 'работа', '..."
98,2116743,"{'id': '2116743', 'premium': False, 'billing_t...","[['обязанность', 'перевозка', 'экспедирование'..."


In [5]:
X = df1['prepared_description'].apply(eval).apply(np.concatenate).apply(' '.join)
X

0     присоединяться крупный европа онлайн школе анг...
1     международный франчайзинговый сеть центр разви...
2     обязанность активный привлечение клиент точка ...
3     обязанность разработка стилевой решение сериал...
4     обязанность проведение погрузочно разгрузочный...
                            ...                        
95    отель кемпински мойка открыть вакансия младший...
96    секретарь знание английский язык обязанность п...
97    обязанность организовывать работа сектор анали...
98    обязанность перевозка экспедирование готовый п...
99    обязанность токарный обработка деталь класс то...
Name: prepared_description, Length: 100, dtype: object

In [6]:
roles = df1['raw_json'].apply(eval).apply(lambda x: x['professional_roles'][0]['name'])
roles

0                                                Другое
1                                     Воспитатель, няня
2     Менеджер по продажам, менеджер по работе с кли...
3                                    Дизайнер, художник
4                                                Другое
                            ...                        
95                      Специалист по подбору персонала
96                                           Переводчик
97                                               Другое
98                                             Водитель
99                       Токарь, фрезеровщик, шлифовщик
Name: raw_json, Length: 100, dtype: object

In [7]:
def count_salary(vac):
    if vac['salary'] is None:
        return np.nan
    if vac['salary']['to'] is None:
        return vac['salary']['from']
    if vac['salary']['from'] is None:
        return vac['salary']['to']
    
    return (vac['salary']['from'] + vac['salary']['to']) // 2

In [8]:
salary = df1['raw_json'].apply(eval).apply(count_salary)
salary = salary.fillna(np.round(salary.mean()))
salary

0     49926.0
1     95000.0
2     42600.0
3     49926.0
4     27000.0
       ...   
95    45000.0
96    27500.0
97    49926.0
98    32500.0
99    49926.0
Name: raw_json, Length: 100, dtype: float64

In [9]:
def text2vec(text):
    encoded_input = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
      model_output = model(**encoded_input)
    return mean_pooling(model_output, encoded_input['attention_mask'])[0]

In [10]:
print(text2vec('hello world'))
print(text2vec('hello world').numpy())

tensor([ 0.2501, -0.8387, -0.1078,  ..., -0.6240, -0.4580, -1.4798])
[ 0.25014663 -0.838737   -0.10777612 ... -0.62403053 -0.45800465
 -1.4797916 ]


In [11]:
df2 = pd.DataFrame(list(map(lambda x: text2vec(x).numpy(), X)))
df2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,120128,120129,120130,120131,120132,120133,120134,120135,120136,120137
0,0.170708,-1.280098,-1.154163,-0.554981,-0.343960,-0.920678,-0.896522,-0.711005,-0.591922,-0.225489,...,-0.961664,-0.866776,-0.704550,-0.708741,0.044820,1.925745,0.601842,-1.436549,-1.038092,-0.665544
1,0.578011,-1.392399,-0.323652,-0.131905,0.093156,-1.084054,-0.315270,-0.639440,-0.722718,-0.278521,...,-0.290188,-0.680886,0.368326,-1.429979,0.377083,1.419051,-0.359759,-1.069397,-1.386881,-0.346862
2,0.978201,-1.006356,-0.435878,-0.089034,-0.455157,-0.845324,-0.865259,-1.034538,-0.936644,0.092933,...,-0.651725,-0.085324,-0.147932,-1.324426,0.863451,1.923470,0.612100,-1.263572,-1.612868,0.142344
3,0.317979,-0.648868,-1.321728,0.434413,-0.223795,-0.988766,0.064955,-0.928809,-0.770875,-0.016484,...,-0.500143,-1.094956,-0.533764,-1.318459,0.489486,1.499255,0.468764,-1.459855,-1.386119,-0.336290
4,1.026274,0.191777,-1.131077,0.638171,-0.350284,-0.367409,-0.016229,-0.539119,-0.769166,0.227206,...,0.432567,-0.368525,-0.177737,-1.225340,0.132909,1.833604,-0.385177,-0.468157,-1.011916,0.436092
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.489686,-1.260512,-0.592443,-0.553109,-0.579180,-0.822251,-0.935030,-1.143974,-1.039525,-0.210864,...,-0.130117,-0.042906,-0.060662,-0.865449,1.332569,1.883497,-0.089781,-1.047070,-1.291353,-0.560146
96,0.412622,-0.884970,-0.703086,-0.271222,-0.247958,-0.639670,-0.325552,-0.650229,-0.845612,0.497393,...,0.342139,-0.385492,0.055042,-0.849230,0.652808,1.737687,0.165650,-1.014284,-1.623425,-0.778147
97,1.379781,-0.175273,-0.556674,0.292000,-0.417891,-0.890286,-0.943647,-0.110682,-0.366390,-0.538272,...,-0.501976,0.377371,0.522915,-0.210290,0.238386,1.613550,-0.135487,-0.225165,-0.852702,-0.268437
98,1.316306,0.332249,-0.925316,0.626877,-0.679694,-0.518702,-0.338710,-0.794350,-0.692470,0.386381,...,-0.427988,-0.105199,-0.186854,-1.060440,0.691401,1.103820,0.398516,-0.655991,-0.570803,-0.197239


# Task 2

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

In [13]:
le = LabelEncoder()
roles1 = le.fit_transform(roles)

In [14]:
def print_classification_metrics(y_test, y_pred):
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))

In [15]:
X_train, X_test, y_train, y_test = train_test_split(df2, roles1, test_size=0.1)

In [16]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

In [17]:
print_classification_metrics(y_test, dtc.predict(X_test))
print("DTC score:", dtc.score(X_test, y_test))

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 1 1 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]]
              precision    recall  f1-score   support

           1       0.00      0.00      0.00       0.0
           3       0.00      0.00      0.00       1.0
           5       0.00      0.00      0.00       0.0
           7       0.00      0.00      0.00       0.0
           9       0.00      0.00      0.00       3.0
          11       0.00      0.00      0.00       1.0
          17       0.00      0.00      0.00       1.0
          18       0.00      0.00      0.00       1.0
          22     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


DTC score: 0.0


In [18]:
from sklearn.naive_bayes import GaussianNB

In [19]:
nb = GaussianNB()
nb.fit(X_train, y_train)

In [20]:
print_classification_metrics(y_test, nb.predict(X_test))
print("Naive bayes score:", nb.score(X_test, y_test))

[[0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0]
 [0 0 3 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0]]
              precision    recall  f1-score   support

           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         1
           9       0.33      1.00      0.50         3
          11       0.00      0.00      0.00         1
          17       0.00      0.00      0.00         1
          18       0.00      0.00      0.00         1
          33       0.00      0.00      0.00         1
          50       0.00      0.00      0.00         1
          53       0.00      0.00      0.00         1

    accuracy                           0.30        10
   macro avg       0.04      0.11      0.06        10
weighted avg       0.10      0.30      0.15        10



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Naive bayes score: 0.3


# Task 3

In [21]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [22]:
def print_regression_metrics(y_test, y_pred):
    print("R2:", r2_score(y_test, y_pred))
    print("MSE:", mean_squared_error(y_test, y_pred))
    print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
    print("MAE:", mean_absolute_error(y_test, y_pred))

In [23]:
X_train, X_test, y_train, y_test = train_test_split(df2, salary, test_size=0.1)

In [24]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [25]:
print_regression_metrics(y_test, lr.predict(X_test))

R2: -0.2073840794142765
MSE: 3206713938.9509645
RMSE: 56627.85479736068
MAE: 32354.32578125


In [26]:
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)

In [27]:
print_regression_metrics(y_test, dtr.predict(X_test))

R2: -0.7783280660839247
MSE: 4723094742.4
RMSE: 68724.77531720274
MAE: 46446.4
