# Распределенное обучение классических моделей

Поговорим про то, как решать задачу машинного обучения, когда самих наблюдений очень много и они все не помещаются на машину.

Центральная идея во всех алгоритмах - параллельно на нескольких машинах посчитать частичные элементы, которые требуются для принятия решения, передать их на центральную машину и сделать шаг алгоритма.

Для обучения линейных моделей на различных машинах будем считать градиент и на главной машине делать шаг градиентного спуска.

Для деревьев решений на различных машинах будем считать распределение по корзинкам (по бинам) и на главной машине будет определять порог для определенного признака.

### Распределенное обучение VW

Vowpal Wabbit также умеет работать распределенно, что делает его универсальным инструментом для обучения линейных моделей на больших данных. Для работы он использует дополнительный компонент - `spanning_tree` - это специальный процесс, который координирует работу различных воркеров между собой.

Про него можно также думать, как про корневую вершину в алгоритме "Tree Allreduce", который используется для эффективной утилизации сети при обучении.

Чтобы иметь возможность использовать `spanning_tree`, необходимо собрать VW руками.


Собирем VW. Делать это нужно с суперпользоателя, поэтому удобнее всего запускать из терминала.

```bash
apt update && \
apt install git psmisc -y && \
apt install libboost-dev libboost-program-options-dev libboost-system-dev libboost-thread-dev libboost-math-dev libboost-test-dev zlib1g-dev cmake g++ -y 


wget https://github.com/google/flatbuffers/archive/v1.12.0.tar.gz && \
tar -xzf v1.12.0.tar.gz && \
cd flatbuffers-1.12.0 && \
mkdir build_dir && \
cd build_dir && \
cmake -G "Unix Makefiles" -DFLATBUFFERS_BUILD_TESTS=Off -DFLATBUFFERS_INSTALL=On -DCMAKE_BUILD_TYPE=Release DFLATBUFFERS_BUILD_FLATHASH=Off .. && \
make install -j$(nproc) && \
cd ../..

git clone --recursive https://github.com/VowpalWabbit/vowpal_wabbit.git && \
cd vowpal_wabbit && \
sudo make && \
cd build && \
sudo make install -j$(nproc)
```

**Хозяйке на заметку** Чтобы получить рутовый доступ с кластера в Azure через Jupyter можно открыть терминал и по ssh подключиться к пользователю `azureuser`. Текущий пользователь `spark` к сожалению имеет очень мало прав.

```bash
ssh azureuser@localhost
sudo su
```

In [40]:
%%writefile install_vw.sh

sudo apt update -y
sudo apt install git psmisc -y 
sudo apt install libboost-dev libboost-program-options-dev libboost-system-dev libboost-thread-dev libboost-math-dev libboost-test-dev zlib1g-dev cmake g++ -y 

wget https://github.com/google/flatbuffers/archive/v1.12.0.tar.gz && \
    tar -xzf v1.12.0.tar.gz && \
    cd flatbuffers-1.12.0 && \
    mkdir build_dir && \
    cd build_dir && \
    cmake -G "Unix Makefiles" -DFLATBUFFERS_BUILD_TESTS=Off -DFLATBUFFERS_INSTALL=On -DCMAKE_BUILD_TYPE=Release DFLATBUFFERS_BUILD_FLATHASH=Off .. && \
    make install -j$(nproc) && \
    cd ../..
    
git clone --recursive https://github.com/VowpalWabbit/vowpal_wabbit.git && \
    cd vowpal_wabbit && \
    git checkout d1ead9a0a9afd56d2ee11a72e0c1aaa7702ee281 && \
    sudo make && \
    cd build && \
    sudo make install -j$(nproc)

Overwriting install_vw.sh


In [None]:
! bash install_vw.sh

In [26]:
! sudo rm -r vowpal_wabbit/

In [1]:
! which vw

/usr/local/bin/vw


In [2]:
! which spanning_tree

/usr/local/bin/spanning_tree


In [7]:
! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
! unzip drugsCom_raw.zip

--2022-03-09 15:19:30--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42989872 (41M) [application/x-httpd-php]
Saving to: ‘drugsCom_raw.zip’


2022-03-09 15:33:35 (49.7 KB/s) - ‘drugsCom_raw.zip’ saved [42989872/42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


In [3]:
! hdfs dfs -ls /user

Found 6 items
drwxr-xr-x   - ubuntu hadoop          0 2022-02-14 13:31 /user/airbnb
drwxr-xr-x   - ubuntu hadoop          0 2022-03-09 18:31 /user/drugs
drwxr-xr-x   - hive   hadoop          0 2022-01-30 14:30 /user/hive
drwxr-xr-x   - ubuntu hadoop          0 2022-02-07 13:32 /user/spark-example
drwxr-xr-x   - ubuntu hadoop          0 2022-02-05 09:25 /user/tweets
drwxr-xr-x   - ubuntu hadoop          0 2022-02-05 08:15 /user/ubuntu


In [12]:
! hdfs dfs -rm -r /user/drugs/data || true
! hdfs dfs -mkdir -p /user/drugs/data

Deleted /user/drugs/data


In [4]:
! hdfs dfs -ls /user/drugs

Found 4 items
drwxr-xr-x   - ubuntu hadoop          0 2022-03-09 16:14 /user/drugs/data
drwxr-xr-x   - ubuntu hadoop          0 2022-03-09 18:31 /user/drugs/part1.vw
drwxr-xr-x   - ubuntu hadoop          0 2022-03-09 18:31 /user/drugs/part2.vw
drwxr-xr-x   - ubuntu hadoop          0 2022-03-09 18:31 /user/drugs/test.vw


Выгрузим датасет с препаратами.

In [13]:
%%bash

cat drugsComTrain_raw.tsv <(tail -n +2 drugsComTest_raw.tsv) | hdfs dfs -put - /user/drugs/data/drugs.tsv

In [14]:
! hdfs dfs -ls  -h /user/drugs/data

Found 1 items
-rw-r--r--   1 ubuntu hadoop    107.2 M 2022-03-09 16:14 /user/drugs/data/drugs.tsv


In [5]:
import findspark
findspark.init()

In [6]:
import pyspark
sc = pyspark.SparkContext(appName="lsml-app-1")

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-03-10 06:49:32,114 WARN util.Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
2022-03-10 06:49:45,779 WARN util.Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
2022-03-10 06:49:45,794 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpo

In [7]:
from pyspark.sql import SparkSession, Row
se = SparkSession(sc)

In [8]:
from pyspark.sql import functions as F
from datetime import datetime
import re

In [9]:
data = se.read.option("delimiter", "\t").csv('/user/drugs/data/*', header=True, inferSchema=True)

                                                                                

In [10]:
data.show()

+--------------------+--------------------+--------------------+--------------------+------+-----------------+-----------+
|                 _c0|            drugName|           condition|              review|rating|             date|usefulCount|
+--------------------+--------------------+--------------------+--------------------+------+-----------------+-----------+
|              206461|           Valsartan|Left Ventricular ...|"""It has no side...|   9.0|     May 20, 2012|         27|
|               95260|          Guanfacine|                ADHD|"""My son is half...|  null|             null|       null|
|We have tried man...|                 8.0|      April 27, 2010|                 192|  null|             null|       null|
|               92703|              Lybrel|       Birth Control|"""I used to take...|  null|             null|       null|
|The positive side...|                 5.0|   December 14, 2009|                  17|  null|             null|       null|
|              1

Мы будем запускать 2 воркера. Поэтмоу разделим весь датасет на 3 части - 2 равные для воркером и 1 маленькую часть для теста.

In [11]:
part1, part2, test = (
    data
    .na.drop('any')
    .randomSplit([0.45, 0.45, 0.1], 422)
)

Соберем датасет на спарке

In [12]:
def convert_to_vw(data):
    target = data['usefulCount']
    
    drug_name = data['drugName'].lower().replace(' ', '_')
    condition = data['condition'].lower().replace(' ', '_')
    
    raw_text = data['review'].lower()
    word_pattern = re.compile(r"[a-zA-Z0-9_]+")
    words = [match.group(0) for match in re.finditer(word_pattern, raw_text)]
    review = ' '.join(words)
    
    rating = data['rating']
    
    weekday = datetime.strptime(data['date'], '%B %d, %Y').weekday()
    
    template = "{target} |d {drug_name} |c {condition} |r {review} |w {weekday} |s rating:{rating}"
    return template.format(
        target=target,
        drug_name=drug_name,
        condition=condition,
        review=review,
        weekday=weekday,
        rating=rating
    )

In [13]:
! hdfs dfs -rm -r /user/drugs/*.vw

Deleted /user/drugs/part1.vw
Deleted /user/drugs/part2.vw
Deleted /user/drugs/test.vw


In [14]:
part1.rdd.map(convert_to_vw).saveAsTextFile('/user/drugs/part1.vw')
part2.rdd.map(convert_to_vw).saveAsTextFile('/user/drugs/part2.vw')
test.rdd.map(convert_to_vw).saveAsTextFile('/user/drugs/test.vw')

                                                                                

In [15]:

! hdfs dfs -cat /user/drugs/part1.vw/* > train.part1.vw
! hdfs dfs -cat /user/drugs/part2.vw/* > train.part2.vw
! hdfs dfs -cat /user/drugs/test.vw/* > test.vw

Посмотрим, какие результаты мы получим, если просто запустим VW на всем файле.

In [16]:
! cat train.*.vw > train.full.vw

In [17]:
import numpy as np
from sklearn.metrics import r2_score


def calc_r2(predictions_filename, answers_filename):
    def read_target_from_vw(vw_record):
        return float(vw_record.split(' ')[0])
    
    with open(predictions_filename, 'r') as f:
        y_pred = np.array([float(value) for value in f.readlines()])
        
    with open(answers_filename, 'r') as f:
        y_expected = np.array([read_target_from_vw(value) for value in f.readlines()])
        
    return r2_score(y_expected, y_pred)

In [18]:
! vw --help | head

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = 
num sources = 1
driver:
  --onethread           Disable parse thread
VW options:
  --ring_size arg (=256, ) size of example ring
  --strict_parse           throw on malformed examples
Update options:
  -l [ --learning_rate ] arg Set learning rate
  --power_t arg              t power value
  --decay_learning_rate arg  Set Decay factor for learning_rate between passes
  --initial_t arg            initial t value


Обучаем VW на одном файле целиком

In [19]:
%%time

! vw --final_regressor drugs.model.bin train.full.vw \
    --onethread \
    --learning_rate 20.0 \
    --bit_precision 23 \
    --passes 40 \
    --ngram r2 \
    --interactions dc \
    --cache -k

Generating 2-grams for r namespaces.
creating features for following interactions: dc 
final_regressor = drugs.model.bin
Num weight bits = 23
learning rate = 20
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = train.full.vw.cache
Reading datafile = train.full.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
16.000000 16.000000            1            1.0   4.0000   0.0000      155
8.284978 0.569956            2            2.0   1.0000   1.7550      207
5.384596 2.484213            4            4.0   0.0000   1.6859      249
3.444507 1.504419            8            8.0   1.0000   0.1637       45
110.780388 218.116268           16           16.0   3.0000   7.7153      139
75.235660 39.690933           32           32.0   2.0000   4.0961      151
51.536644 27.837627           64           64.0   2.0000   1.9416       7

In [20]:
! vw --testonly --initial_regressor drugs.model.bin --predictions drugs.preductions.txt test.vw

Generating 2-grams for r namespaces.
creating features for following interactions: dc 
only testing
predictions = drugs.preductions.txt
Num weight bits = 23
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = test.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.447677 0.447677            1            1.0   2.0000   2.6691      317
2.223839 4.000000            2            2.0   2.0000   0.0000      237
4.300955 6.378071            4            4.0  10.0000  11.9381      163
25.360267 46.419579            8            8.0   1.0000  14.3987       59
17.799924 10.239580           16           16.0   2.0000   9.0071      285
24.017554 30.235185           32           32.0   7.0000   3.3096      125
158.437527 292.857500           64           64.0  10.0000  44.5076      175
336.991100 515.544673          128     

In [21]:
calc_r2('drugs.preductions.txt', 'test.vw')

0.6526252786437012

Обучили модель на **0.65** за **30** секунд.

Посмотрим, что будет если мы обучим модель только на части данных

In [22]:
%%time

! vw --final_regressor drugs.model.bin train.part1.vw \
    --onethread \
    --learning_rate 20.0 \
    --bit_precision 23 \
    --passes 40 \
    --ngram r2 \
    --interactions dc \
    --cache -k

Generating 2-grams for r namespaces.
creating features for following interactions: dc 
final_regressor = drugs.model.bin
Num weight bits = 23
learning rate = 20
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = train.part1.vw.cache
Reading datafile = train.part1.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
16.000000 16.000000            1            1.0   4.0000   0.0000      155
8.284978 0.569956            2            2.0   1.0000   1.7550      207
5.384596 2.484213            4            4.0   0.0000   1.6859      249
3.444507 1.504419            8            8.0   1.0000   0.1637       45
110.780388 218.116268           16           16.0   3.0000   7.7153      139
75.235660 39.690933           32           32.0   2.0000   4.0961      151
51.536644 27.837627           64           64.0   2.0000   1.9416      

In [23]:
! vw --testonly --initial_regressor drugs.model.bin --predictions drugs.preductions.txt test.vw

Generating 2-grams for r namespaces.
creating features for following interactions: dc 
only testing
predictions = drugs.preductions.txt
Num weight bits = 23
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = test.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1054.792969 1054.792969            1            1.0   2.0000  34.4776      317
529.396484 4.000000            2            2.0   2.0000   0.0000      237
351.250557 173.104630            4            4.0  10.0000  28.3633      163
214.487823 77.725090            8            8.0   1.0000  16.5387       59
140.830627 67.173430           16           16.0   2.0000  21.5064      285
101.366076 61.901526           32           32.0   7.0000   3.7109      125
202.674485 303.982894           64           64.0  10.0000  32.8632      175
677.614774 1152.555063  

In [24]:

calc_r2('drugs.preductions.txt', 'test.vw')

0.38178390321607647

Гораздо быстрее обучились, но потеряли в качестве. 

Модель на **0.38** за **6** секунд

**Мораль** - семплирование не самых удачный подход, чтобы получать качество, нужно засовывать в модель вообще все данные.

Запустим в фоновом режиме `spanning_tree` и проверим что он правда работает.

Далее воркеры будут подключаться к нему по tcp.

In [25]:
%%bash --bg --out OUT --err ERR
spanning_tree --nondaemon

In [26]:
! ps aux | grep spanning_tree

ubuntu      6123  0.0  0.0   6068  1636 ?        S    06:54   0:00 spanning_tree --nondaemon
ubuntu      6130  0.0  0.0   9492  3248 pts/0    Ss+  06:55   0:00 /bin/bash -c  ps aux | grep spanning_tree
ubuntu      6132  0.0  0.0   9032   656 pts/0    S+   06:55   0:00 grep spanning_tree


Пора запускать рабочих. Для этого используется уже известная команда vw, в которую просто добавляются специальные параметры

* `--span_server` - указываем адрес, где находится менеджер (spanning_tree). В нашем случае это localhost. В реальной жизни там мог бы быть IP адрес другой машины
* `--unique_id` - так как один spanning_tree может обрабатывать сразу много различных процессов обучения, то необходимо их как-то разграничить. Для этого используется unique_id - это число, которое должно быть одинаковым для всех ваших рабочих, чтобы их не перепутали с другими. Например ваш коллега также обучает VW но для другой задачи - он может подключить свои VW к этому же spanning_tree указав для них unique_id = 0. В таком случае вам, чтобы подключиться, нужно запускать свои рабочие например с unique_id = 5, чтобы они не смешались с рабочими вашего коллеги.
* `--total` - число рабочих, которое вы планируете подключить в текущей сессии обучения
--node - идентификатор текущего рабочего. Нумерация начинается с нуля, поэтому если вы хотите запустить 3 рабочих, то им нужно выдать значения для --node 0, 1 и 2.
* `-d` - данные для обработки для текущего рабочего
Все остальные параметры обучения должны быть одинаковыми для всех рабочих.

Чтобы сохранить коэффициенты полученной модели, необходимо для какого-то одного рабочего указать через `-f` или `--final_regressor` файл, куда записать результат. Точно также, как мы это делали в предыдущей лабораторной.

Запустим двух рабочих. Первого запустим также в фоне, а вот второй запустим прямо в ноутбуке и будем следить за процессом обучения.

In [27]:
%%bash --bg --out OUT --err ERR

vw -d train.part1.vw \
    --span_server localhost \
    --total 2 \
    --node 0 \
    --unique_id 1 \
    --learning_rate 20.0 \
    --bit_precision 23 \
    --passes 40 \
    --ngram r2 \
    --interactions dc \
    --cache -k

In [28]:
! ps aux | grep vw

ubuntu      6185 81.5  2.2 255612 180984 ?       Sl   06:57   0:01 vw -d train.part1.vw --span_server localhost --total 2 --node 0 --unique_id 1 --learning_rate 20.0 --bit_precision 23 --passes 40 --ngram r2 --interactions dc --cache -k
ubuntu      6187  0.0  0.0   9492  3252 pts/0    Ss+  06:57   0:00 /bin/bash -c  ps aux | grep vw
ubuntu      6189  0.0  0.0   8900   728 pts/0    S+   06:57   0:00 grep vw


In [29]:
%%time

! vw -d train.part2.vw \
    --span_server localhost \
    --total 2 \
    --node 1 \
    --unique_id 1 \
    --learning_rate 20.0 \
    --bit_precision 23 \
    --passes 40 \
    --ngram r2 \
    --interactions dc \
    --cache -k \
    -f drugs.model.bin

Generating 2-grams for r namespaces.
creating features for following interactions: dc 
final_regressor = drugs.model.bin
Num weight bits = 23
learning rate = 20
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = train.part2.vw.cache
Reading datafile = train.part2.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
4.000000 4.000000            1            1.0   2.0000   0.0000      207
9.166723 14.333446            2            2.0   4.0000   0.2140       61
42.704962 76.243202            4            4.0   0.0000   6.1127      235
53.751544 64.798126            8            8.0  18.0000   2.1770      203
53.708574 53.665604           16           16.0  24.0000   6.9774      155
40.367511 27.026447           32           32.0   3.0000   6.5818      273
45.832557 51.297603           64           64.0   2.0000   8.2354     

In [30]:
! vw --testonly --initial_regressor drugs.model.bin --predictions drugs.preductions.txt test.vw

Generating 2-grams for r namespaces.
creating features for following interactions: dc 
only testing
predictions = drugs.preductions.txt
Num weight bits = 23
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = test.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
2.849555 2.849555            1            1.0   2.0000   3.6881      317
3.424778 4.000000            2            2.0   2.0000   0.0000      237
6.356657 9.288537            4            4.0  10.0000  13.0947      163
34.073244 61.789832            8            8.0   1.0000  16.2863       59
32.563332 31.053420           16           16.0   2.0000  16.5415      285
31.656058 30.748784           32           32.0   7.0000   3.2724      125
82.980499 134.304941           64           64.0  10.0000  43.8365      175
333.861668 584.742837          128      

In [31]:
calc_r2('drugs.preductions.txt', 'test.vw')

0.6496397151076883

Качество получилось даже немного больше, чем при одиночном запуске.

Сильного ускорения по времени мы не увидели, потому что мы все это запускаем на одной машине. Однако если запускать эти воркеры на разных машинах и на больших объемах данных, то можно увидеть сильное ускорение процесса обучения.

И основное достижение этого алгоритма - теперь мы можем размещать данные по нескольким машинам, что позволяет нам теоретически обработать датасет произвольного размера.

### VW на Hadoop

VW достаточно несложно запустить в виде обычной MapReduce задачи. Для этого даже есть готовый скрипт, который написан авторами инструмента. 

Почитать про то, как запускать этот инструмент на Hadoop можно вот здесь - https://github.com/VowpalWabbit/vowpal_wabbit/tree/master/cluster .

Мы же с вами более внимательно рассмотрим более удобный интерфейс для распределенного обучения VW на кластере.

### MMLSpark

Существует целый набор библиотек для Spark от Microsoft, который позволяет удобно и быстро запускать распределенные алгоритмы на кластере Spark. Про все возможности можно почитать на официальном GitHub - https://github.com/Azure/mmlspark .

Мы с вами воспользуемся двумя инструментами оттуда - VW и LightGBM (градиентный бустинг).


Чтобы поставить mmlspark в окружение с lyvi (это окржуение присутствует в кластере azure), достаточно просто переконфигурировать сессию спарка.

In [None]:
! cd /home/ubuntu/.ivy2/jars && \
    cp io.netty_netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar io.netty_netty-transport-native-epoll-4.1.68.Final.jar && \
    cp io.netty_netty-transport-native-kqueue-4.1.68.Final-osx-x86_64.jar io.netty_netty-transport-native-kqueue-4.1.68.Final.jar && \
    cp io.netty_netty-resolver-dns-native-macos-4.1.68.Final-osx-x86_64.jar io.netty_netty-resolver-dns-native-macos-4.1.68.Final.jar

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
se = pyspark.sql.SparkSession.builder.appName("MyApp2") \
            .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.5") \
            .config("spark.dynamicAllocation.enabled", False) \
            .config("spark.locality.wait", 0) \
            .getOrCreate()


SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.microsoft.azure#synapseml_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0193688d-5cd9-4d14-9cad-50511bd430f9;1.0
	confs: [default]
	found com.microsoft.azure#synapseml_2.12;0.9.5 in central
	found com.microsoft.azure#synapseml-core_2.12;0.9.5 in central
	found org.scalactic#

In [3]:
from pyspark.sql.functions import when, col
from pyspark.ml import Pipeline
from synapse.ml.vw import VowpalWabbitFeaturizer, VowpalWabbitRegressor

In [4]:
data = se.read.option("delimiter", "\t").csv('/user/drugs/data/*', header=True, inferSchema=True)

                                                                                

In [5]:
data.limit(10).toPandas()

Unnamed: 0,_c0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""""""It has no side effect, I take it in combina...",9.0,"May 20, 2012",27.0
1,95260,Guanfacine,ADHD,"""""""My son is halfway through his fourth week o...",,,
2,We have tried many different medications and s...,8.0,"April 27, 2010",192,,,
3,92703,Lybrel,Birth Control,"""""""I used to take another oral contraceptive, ...",,,
4,The positive side is that I didn&#039;t have a...,5.0,"December 14, 2009",17,,,
5,138000,Ortho Evra,Birth Control,"""""""This is my first time using any form of bir...",8.0,"November 3, 2015",10.0
6,35696,Buprenorphine / naloxone,Opiate Dependence,"""""""Suboxone has completely turned my life arou...",9.0,"November 27, 2016",37.0
7,155963,Cialis,Benign Prostatic Hyperplasia,"""""""2nd day on 5mg started to work with rock ha...",2.0,"November 28, 2015",43.0
8,165907,Levonorgestrel,Emergency Contraception,"""""""He pulled out, but he cummed a bit in me. I...",1.0,"March 7, 2017",5.0
9,102654,Aripiprazole,Bipolar Disorde,"""""""Abilify changed my life. There is hope. I w...",10.0,"March 14, 2015",32.0


In [6]:
data.columns

['_c0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount']

In [7]:
columns = [
    '_c0',
    'd',
    'c',
    'r',
    'rating',
    'data',
    'target',
]
df = data.toDF(*columns)
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- d: string (nullable = true)
 |-- c: string (nullable = true)
 |-- r: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- data: string (nullable = true)
 |-- target: integer (nullable = true)



In [8]:
train, test = (
    df
    .na.drop('any')
    .randomSplit([0.9, 0.1], 422)
)

In [9]:
train.show()

[Stage 3:>                                                          (0 + 1) / 1]

+------+--------------------+-------------+--------------------+------+-----------------+------+
|   _c0|                   d|            c|                   r|rating|             data|target|
+------+--------------------+-------------+--------------------+------+-----------------+------+
| 10000|      Lo Loestrin Fe|Birth Control|"""I was on this ...|   7.0|   April 10, 2013|     4|
|100012|Desogestrel / eth...|Birth Control|"""I&#039;ve been...|   9.0|    June 17, 2017|     2|
|100013|Desogestrel / eth...|Birth Control|"""Gives me heart...|   2.0|     June 8, 2017|     1|
|100029|Desogestrel / eth...|Birth Control|"""I was switched...|   5.0| December 6, 2017|     0|
| 10004|      Lo Loestrin Fe|Birth Control|"""I&#039;m 41, u...|   2.0|   March 31, 2013|    12|
|100055|Desogestrel / eth...|Birth Control|"""I was on Apri ...|   5.0|      May 2, 2017|     1|
| 10007|      Lo Loestrin Fe|Birth Control|"""I posted on th...|   9.0|   March 10, 2013|    18|
|100071|Desogestrel / eth...|B

                                                                                

Создадим объект для создания признаков в формате VW. Он принимает dataframe и возвращает dataframe но уже с новой колонкой, в которой записаны эти признаки

In [10]:
vw_featurizer = VowpalWabbitFeaturizer(
    inputCols=["rating"], 
    stringSplitInputCols=["d", "c", "r"],
    outputCol="features",
    numBits=24
)

In [11]:
x = vw_featurizer.transform(train).rdd.first()
x['features']

                                                                                

SparseVector(16777216, {139281: 1.0, 1016583: 1.0, 1102820: 1.0, 1162472: 2.0, 1505820: 1.0, 2204132: 1.0, 2339869: 1.0, 2673967: 1.0, 2679020: 1.0, 2839966: 1.0, 2991125: 1.0, 3257429: 1.0, 3346639: 6.0, 3410361: 1.0, 3446374: 1.0, 3783103: 2.0, 3803230: 1.0, 4415074: 1.0, 4597225: 1.0, 4709980: 1.0, 4778637: 1.0, 4890812: 1.0, 5367110: 1.0, 5426661: 2.0, 5481570: 1.0, 5728618: 1.0, 5837165: 1.0, 5881332: 1.0, 6192782: 1.0, 6362983: 1.0, 6366072: 1.0, 6737743: 1.0, 7337106: 2.0, 7608613: 1.0, 7636861: 3.0, 8148668: 1.0, 8336163: 1.0, 8515415: 1.0, 9170487: 1.0, 9519332: 1.0, 9651660: 1.0, 9787552: 1.0, 9845063: 1.0, 9894590: 1.0, 9970646: 1.0, 10090473: 1.0, 10189708: 1.0, 11318998: 1.0, 11946903: 1.0, 12243560: 1.0, 12463287: 2.0, 12730453: 1.0, 12741825: 1.0, 13350349: 1.0, 13357553: 2.0, 13735132: 3.0, 13901790: 1.0, 14380379: 1.0, 14524688: 1.0, 14608623: 1.0, 14946398: 1.0, 15174436: 1.0, 15384876: 1.0, 15847749: 1.0, 16681717: 1.0, 16772414: 1.0})

Создадим объект для обучения классификатора. Схема работы точно такая же - принимает на вход dataframe и потом может модифицировать другой dataframe, делая предсказание.

In [12]:
args = "--learning_rate 20.0 --bit_precision 24 --ngram r2 --interactions dc"
vw_model = VowpalWabbitRegressor(
    featuresCol="features",
    labelCol="target",
    args=args,
    numPasses=40
)

Соберем их в единый пайплайн

In [13]:
vw_pipeline = Pipeline(stages=[vw_featurizer, vw_model])

In [14]:
vw_trained = vw_pipeline.fit(train)

[Stage 5:>                                                          (0 + 4) / 4]

nonce 1508075414 still waiting for 3 nodes out of 4 for example node 0
nonce 1508075414 still waiting for 2 nodes out of 4 for example node 2


inbound connection from 10.128.0.27(rc1a-dataproc-d-er3ey0m8pf4us89i.mdb.yandexcloud.net:15745) serv=58694
10.128.0.27(rc1a-dataproc-d-er3ey0m8pf4us89i.mdb.yandexcloud.net:15745): nonce=1508075414
10.128.0.27(rc1a-dataproc-d-er3ey0m8pf4us89i.mdb.yandexcloud.net:15745): total=4
10.128.0.27(rc1a-dataproc-d-er3ey0m8pf4us89i.mdb.yandexcloud.net:15745): node id=1
inbound connection from 10.128.0.27(rc1a-dataproc-d-er3ey0m8pf4us89i.mdb.yandexcloud.net:15745) serv=58696
10.128.0.27(rc1a-dataproc-d-er3ey0m8pf4us89i.mdb.yandexcloud.net:15745): nonce=1508075414
10.128.0.27(rc1a-dataproc-d-er3ey0m8pf4us89i.mdb.yandexcloud.net:15745): total=4
10.128.0.27(rc1a-dataproc-d-er3ey0m8pf4us89i.mdb.yandexcloud.net:15745): node id=0
inbound connection from 10.128.0.37(rc1a-dataproc-d-rnb3p96b656hsorq.mdb.yandexcloud.net:15745) serv=34822
10.128.0.37(rc1a-dataproc-d-rnb3p96b656hsorq.mdb.yandexcloud.net:15745): nonce=1508075414
10.128.0.37(rc1a-dataproc-d-rnb3p96b656hsorq.mdb.yandexcloud.net:15745): total=4


nonce 1508075414 still waiting for 1 nodes out of 4 for example node 2


inbound connection from 10.128.0.37(rc1a-dataproc-d-rnb3p96b656hsorq.mdb.yandexcloud.net:15745) serv=34824
10.128.0.37(rc1a-dataproc-d-rnb3p96b656hsorq.mdb.yandexcloud.net:15745): nonce=1508075414
10.128.0.37(rc1a-dataproc-d-rnb3p96b656hsorq.mdb.yandexcloud.net:15745): total=4
10.128.0.37(rc1a-dataproc-d-rnb3p96b656hsorq.mdb.yandexcloud.net:15745): node id=2
                                                                                

In [15]:
prediction = vw_trained.transform(test)

Generating 2-grams for r namespaces.
creating features for following interactions: dc 
only testing
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = 
num sources = 1


In [16]:
prediction.show()

[Stage 6:>                                                          (0 + 1) / 1]

+------+--------------------+-------------+--------------------+------+------------------+------+--------------------+------------------+------------------+
|   _c0|                   d|            c|                   r|rating|              data|target|            features|     rawPrediction|        prediction|
+------+--------------------+-------------+--------------------+------+------------------+------+--------------------+------------------+------------------+
|100009|Desogestrel / eth...|Birth Control|"""I was on recli...|   1.0|     June 28, 2017|     2|(16777216,[139281...|               0.0|               0.0|
| 10002|      Lo Loestrin Fe|Birth Control|"""Well, I&#039;v...|   6.0|     April 4, 2013|    10|(16777216,[101658...|20.970989227294922|20.970989227294922|
|100091|Desogestrel / eth...|Birth Control|"""I started Apri...|   8.0|   January 8, 2017|     3|(16777216,[237306...|2.1582489013671875|2.1582489013671875|
|100094|Desogestrel / eth...|Birth Control|"""I started Ap

                                                                                

In [17]:
from synapse.ml.train import ComputeModelStatistics
metrics = ComputeModelStatistics(
    evaluationMetric='regression',
    labelCol='target',
    scoresCol='prediction'
).transform(prediction)

                                                                                

In [18]:
metrics.show()

+------------------+-----------------------+------------------+-------------------+
|mean_squared_error|root_mean_squared_error|               R^2|mean_absolute_error|
+------------------+-----------------------+------------------+-------------------+
|  684.964669451763|     26.171829692472077|0.4751086842716862| 16.509441626130055|
+------------------+-----------------------+------------------+-------------------+



### SparkML

Нужно отметить, что в стандартной библиотеке Spark присутствует модуль для машинного обучения.

**ОДНАКО** нужно сказать, что работает он крайне плохо. Лучшее, что вы можете с ним сделать - это попробовать один раз его запустить и понять, что больше никогда не будете его использовать.

Это правда важно, потому что это не звучит слишком убедительно, что стандартная библиотека для ML насколько уж плохо работет и наверное все таки есть случаи, когда она работает хорошо, правда ведь? Ответ - вполне возможно. Чтобы вам самим понять, есть ли такие случаи, попробуйте самостоятельно что-то обучить на SparkML и прочувствуйте границы применимости :)

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
sc = pyspark.SparkContext(appName="lsml-app-1")

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-03-10 07:13:09,381 WARN util.Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
2022-03-10 07:13:14,941 WARN util.Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.


In [3]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

In [4]:
from pyspark.sql import SparkSession, Row

se = SparkSession(sc)

In [5]:
data = se.read.option("delimiter", "\t").csv('/user/drugs/data/*', header=True, inferSchema=True)

                                                                                

In [6]:
data = (
    data
    .na.drop('any')
    .withColumn('ratingNum', data.rating.cast('integer'))
)


train, test = data.randomSplit([0.9, 0.1], 422)
train, test = train.cache(), test.cache()

In [7]:
tokenizer = Tokenizer(inputCol="review", outputCol="words")
wordsData = tokenizer.transform(train)

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=2**23)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)

rescaledData = idfModel.transform(featurizedData)

                                                                                

In [8]:
rescaledData.limit(1).show()

2022-03-10 07:14:20,470 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 128.1 MiB
[Stage 3:>                                                          (0 + 2) / 2]

+---+-------------------+--------------------+--------------------+------+----------------+-----------+---------+--------------------+--------------------+--------------------+
|_c0|           drugName|           condition|              review|rating|            date|usefulCount|ratingNum|               words|         rawFeatures|            features|
+---+-------------------+--------------------+--------------------+------+----------------+-----------+---------+--------------------+--------------------+--------------------+
| 10|Medroxyprogesterone|Abnormal Uterine ...|"""I&#039;m 17 ye...|   7.0|October 20, 2015|          2|        7|["""i&#039;m, 17,...|(8388608,[18700,1...|(8388608,[18700,1...|
+---+-------------------+--------------------+--------------------+------+----------------+-----------+---------+--------------------+--------------------+--------------------+



                                                                                

In [9]:
stringIndexer = StringIndexer(inputCol='drugName', outputCol = "drugIndex").setHandleInvalid("skip")
encoder = OneHotEncoder(inputCol="drugIndex", outputCol="drugVec")

pipeline = Pipeline(stages=[stringIndexer, encoder])
ohe = pipeline.fit(rescaledData).transform(rescaledData)

                                                                                

In [10]:
x = ohe.limit(1).rdd.first()
x

2022-03-09 19:32:38,736 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 128.4 MiB
2022-03-09 19:32:39,856 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 128.3 MiB
                                                                                

Row(_c0='10', drugName='Medroxyprogesterone', condition='Abnormal Uterine Bleeding', review='"""I&#039;m 17 years old and I got shot in August 2015, personally. I don&#039;t mind it. I mean, I bleed little bits and random times, but I&#039;d rather have the blood that&#039;s supposed to come out, come out and not worry about where it&#039;s going or staying in my body. I have my other injection in November on the 2nd, and I&#039;m still wondering if I could take it again. The only downside to the injection is that I gained access weight and I&#039;m kind of moody."""', rating='7.0', date='October 20, 2015', usefulCount=2, ratingNum=7, words=['"""i&#039;m', '17', 'years', 'old', 'and', 'i', 'got', 'shot', 'in', 'august', '2015,', 'personally.', 'i', 'don&#039;t', 'mind', 'it.', 'i', 'mean,', 'i', 'bleed', 'little', 'bits', 'and', 'random', 'times,', 'but', 'i&#039;d', 'rather', 'have', 'the', 'blood', 'that&#039;s', 'supposed', 'to', 'come', 'out,', 'come', 'out', 'and', 'not', 'worry',

In [11]:
x['drugVec']

SparseVector(3492, {13: 1.0})

Подготавливаем признаки

In [30]:
wordsData = tokenizer.transform(train)

tokenizer = Tokenizer(inputCol="review", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=2**23)
idf = IDF(inputCol="rawFeatures", outputCol="revviewFeatures")

stringIndexerCondition = StringIndexer(inputCol='condition', outputCol = "conditionIndex").setHandleInvalid("skip")
encoderCondition = OneHotEncoder(inputCol="conditionIndex", outputCol="conditionVec")

stringIndexerDrug = StringIndexer(inputCol='drugName', outputCol = "drugIndex").setHandleInvalid("skip")
encoderDrug = OneHotEncoder(inputCol="drugIndex", outputCol="drugVec")

assembler = VectorAssembler(inputCols=["drugVec", "conditionVec", "revviewFeatures", 'ratingNum'], outputCol="features")

preproc = Pipeline(stages=[
    tokenizer,
    hashingTF,
    idf,
    stringIndexerCondition,
    encoderCondition,
    stringIndexerDrug,
    encoderDrug,
    assembler
])

In [31]:
train_proc = preproc.fit(train).transform(train).cache()

                                                                                

KeyboardInterrupt: 

In [None]:
train_proc.show()

In [12]:
tokenizer = Tokenizer(inputCol="review", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=2**23)
idf = IDF(inputCol="rawFeatures", outputCol="revviewFeatures")

preproc = Pipeline(stages=[
    tokenizer,
    hashingTF,
    idf
])

Запустить сбор признаков, который написан, у вас скорее всего не получится. Поэтому попробуем урезать количество вычислений - может быть получится.

In [9]:
stringIndexerCondition = StringIndexer(inputCol='condition', outputCol = "conditionIndex").setHandleInvalid("skip")
encoderCondition = OneHotEncoder(inputCol="conditionIndex", outputCol="conditionVec")

stringIndexerDrug = StringIndexer(inputCol='drugName', outputCol = "drugIndex").setHandleInvalid("skip")
encoderDrug = OneHotEncoder(inputCol="drugIndex", outputCol="drugVec")

assembler = VectorAssembler(inputCols=["drugVec", "conditionVec", 'ratingNum'], outputCol="features")


preproc = Pipeline(stages=[
    stringIndexerCondition,
    encoderCondition,
    stringIndexerDrug,
    encoderDrug,
    assembler
])

In [10]:
preproc = preproc.fit(data)

                                                                                

In [11]:
train_proc = preproc.transform(train).cache()
test_proc = preproc.transform(test).cache()

In [12]:
train_proc.show()

[Stage 9:>                                                          (0 + 1) / 1]

+------+--------------------+--------------------+--------------------+------+------------------+-----------+---------+--------------+----------------+---------+-------------------+--------------------+
|   _c0|            drugName|           condition|              review|rating|              date|usefulCount|ratingNum|conditionIndex|    conditionVec|drugIndex|            drugVec|            features|
+------+--------------------+--------------------+--------------------+------+------------------+-----------+---------+--------------+----------------+---------+-------------------+--------------------+
|    10| Medroxyprogesterone|Abnormal Uterine ...|"""I&#039;m 17 ye...|   7.0|  October 20, 2015|          2|        7|          14.0|(896,[14],[1.0])|     14.0|  (3572,[14],[1.0])|(4469,[14,3586,44...|
|  1000|          Everolimus|        Breast Cance|"""Although the m...|   2.0|    March 15, 2016|          4|        2|          83.0|(896,[83],[1.0])|   1379.0|(3572,[1379],[1.0])|(4469,[

                                                                                

In [13]:
train_proc.rdd.first()

                                                                                

Row(_c0='10', drugName='Medroxyprogesterone', condition='Abnormal Uterine Bleeding', review='"""I&#039;m 17 years old and I got shot in August 2015, personally. I don&#039;t mind it. I mean, I bleed little bits and random times, but I&#039;d rather have the blood that&#039;s supposed to come out, come out and not worry about where it&#039;s going or staying in my body. I have my other injection in November on the 2nd, and I&#039;m still wondering if I could take it again. The only downside to the injection is that I gained access weight and I&#039;m kind of moody."""', rating='7.0', date='October 20, 2015', usefulCount=2, ratingNum=7, conditionIndex=14.0, conditionVec=SparseVector(896, {14: 1.0}), drugIndex=14.0, drugVec=SparseVector(3572, {14: 1.0}), features=SparseVector(4469, {14: 1.0, 3586: 1.0, 4468: 7.0}))

Если все таки удалось собрать датасет, то запускаем линейную регрессию

In [14]:
lr = LinearRegression(featuresCol='features', labelCol='usefulCount', maxIter=10, regParam=0.3, elasticNetParam=0.8)

In [15]:
lrModel = lr.fit(train_proc)

2022-03-10 07:17:45,095 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2022-03-10 07:17:45,097 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


In [16]:
lrModel.coefficients

SparseVector(4469, {0: -1.5396, 7: 0.6016, 8: -2.2478, 11: 18.5525, 12: -2.6516, 13: 6.3993, 15: -4.2273, 17: 2.2721, 18: 2.7929, 19: -4.6926, 20: -1.8025, 22: -1.4182, 24: 1.5974, 25: 5.6574, 27: -3.6162, 28: 14.7229, 30: -2.1567, 31: -6.25, 32: 4.1213, 33: -0.733, 34: -1.3478, 36: 16.8292, 37: -2.0554, 42: -1.3523, 43: 8.9463, 44: 16.5407, 45: 7.6053, 46: -4.8076, 47: 19.9402, 49: 1.5438, 51: 21.3003, 54: -3.1937, 57: -0.3751, 59: 3.3786, 61: -7.7252, 63: 17.8873, 64: 3.3672, 68: -7.1302, 69: 4.5331, 73: 1.5355, 74: 22.0911, 75: 10.1474, 77: 12.5425, 82: 0.837, 83: 10.9366, 85: 0.745, 88: -1.6474, 89: -1.2763, 91: 3.3217, 94: -4.1866, 97: 8.6183, 101: 5.3111, 104: 7.3622, 106: 15.4524, 107: -1.2783, 109: 12.6343, 110: 21.9048, 111: 8.4787, 113: -5.7946, 116: 1.2026, 118: 15.4499, 119: 3.2374, 120: 6.8017, 122: 7.03, 123: 10.7318, 124: 26.3091, 126: -0.7743, 132: 1.4863, 139: -0.3653, 140: 3.7497, 142: -0.4101, 143: 34.2557, 145: 14.5784, 146: 14.928, 149: 5.9699, 152: 5.5148, 153: 6.

In [17]:
from pyspark.ml.evaluation import RegressionEvaluator

predictions = lrModel.transform(test_proc)

In [18]:
predictions.show()

[Stage 28:>                                                         (0 + 1) / 1]

+------+--------------------+-------------+--------------------+------+------------------+-----------+---------+--------------+---------------+---------+-----------------+--------------------+-------------------+
|   _c0|            drugName|    condition|              review|rating|              date|usefulCount|ratingNum|conditionIndex|   conditionVec|drugIndex|          drugVec|            features|         prediction|
+------+--------------------+-------------+--------------------+------+------------------+-----------+---------+--------------+---------------+---------+-----------------+--------------------+-------------------+
|   100| Medroxyprogesterone|Birth Control|"""Depo was not f...|   5.0|   August 17, 2015|          2|        5|           0.0|(896,[0],[1.0])|     14.0|(3572,[14],[1.0])|(4469,[14,3572,44...|  6.099772128001648|
|100002|Desogestrel / eth...|Birth Control|"""I have been ta...|   4.0|     July 12, 2017|          2|        4|           0.0|(896,[0],[1.0])|     

                                                                                

In [19]:
lr_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="usefulCount", metricName="r2")
lr_evaluator.evaluate(predictions)

                                                                                

0.27444947176597745

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
se = pyspark.sql.SparkSession.builder.appName("MyApp2") \
            .config("spark.jars.packages", "ai.catboost:catboost-spark_3.0_2.12:1.0.4") \
            .config("spark.dynamicAllocation.enabled", False) \
            .config("spark.locality.wait", 0) \
            .getOrCreate()


SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
ai.catboost#catboost-spark_3.0_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6579ddc8-dc65-471d-89e8-d3cf9472068f;1.0
	confs: [default]
	found ai.catboost#catboost-spark_3.0_2.12;1.0.4 in central
	found com.google.guava#guava;29.0-jre in central
	found com.google.guava#failureac

In [3]:
import catboost_spark

In [4]:
data = se.read.option("delimiter", "\t").csv('/user/drugs/data/*', header=True, inferSchema=True)

data = (
    data
    .na.drop('any')
    .withColumn('ratingNum', data.rating.cast('integer'))
)


train, test = data.randomSplit([0.9, 0.1], 422)
train, test = train.cache(), test.cache()

                                                                                

In [5]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

In [6]:
stringIndexerCondition = StringIndexer(inputCol='condition', outputCol = "conditionIndex").setHandleInvalid("skip")
encoderCondition = OneHotEncoder(inputCol="conditionIndex", outputCol="conditionVec")

stringIndexerDrug = StringIndexer(inputCol='drugName', outputCol = "drugIndex").setHandleInvalid("skip")
encoderDrug = OneHotEncoder(inputCol="drugIndex", outputCol="drugVec")

assembler = VectorAssembler(inputCols=["drugVec", "conditionVec", 'ratingNum'], outputCol="features")


preproc = Pipeline(stages=[
    stringIndexerCondition,
    encoderCondition,
    stringIndexerDrug,
    encoderDrug,
    assembler
])

In [7]:
preproc = preproc.fit(data)

                                                                                

In [8]:
train_proc = preproc.transform(train).cache()
test_proc = preproc.transform(test).cache()

In [9]:
train_proc.limit(10).toPandas()

                                                                                

Unnamed: 0,_c0,drugName,condition,review,rating,date,usefulCount,ratingNum,conditionIndex,conditionVec,drugIndex,drugVec,features
0,10000,Lo Loestrin Fe,Birth Control,"""""""I was on this birth control for 8 months. T...",7.0,"April 10, 2013",4,7,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",35.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,100012,Desogestrel / ethinyl estradiol,Birth Control,"""""""I&#039;ve been taking Velivet for about a y...",9.0,"June 17, 2017",2,9,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",60.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,100013,Desogestrel / ethinyl estradiol,Birth Control,"""""""Gives me heartburn and indigestion. Also ma...",2.0,"June 8, 2017",1,2,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",60.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,100029,Desogestrel / ethinyl estradiol,Birth Control,"""""""I was switched from Azurette to Viorele by ...",5.0,"December 6, 2017",0,5,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",60.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,10004,Lo Loestrin Fe,Birth Control,"""""""I&#039;m 41, using for cramps and excessive...",2.0,"March 31, 2013",12,2,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",35.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,100055,Desogestrel / ethinyl estradiol,Birth Control,"""""""I was on Apri for about a year. The first f...",5.0,"May 2, 2017",1,5,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",60.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,10007,Lo Loestrin Fe,Birth Control,"""""""I posted on this forum when I first started...",9.0,"March 10, 2013",18,9,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",35.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7,100071,Desogestrel / ethinyl estradiol,Birth Control,"""""""So I&#039;ve only been on this pill for a m...",8.0,"March 7, 2017",7,8,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",60.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8,100085,Desogestrel / ethinyl estradiol,Birth Control,"""""""I was on this for a year or two, I had been...",1.0,"January 28, 2017",4,1,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",60.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
9,100087,Desogestrel / ethinyl estradiol,Birth Control,"""""""I&#039;m 19 and have been on this birth con...",9.0,"January 26, 2017",3,9,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",60.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [10]:
from pyspark.sql.types import *
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import Row,SparkSession

In [12]:
srcDataSchema = [
    StructField("features", VectorUDT()),
    StructField("label", DoubleType())
]

In [13]:
train_proc.rdd.map(lambda x: Row(x.features, float(x.usefulCount))).take(1)

                                                                                

[<Row(SparseVector(4469, {35: 1.0, 3572: 1.0, 4468: 7.0}), 4.0)>]

In [14]:
trainData = train_proc.rdd.map(lambda x: Row(x.features, float(x.usefulCount)))

In [15]:
trainDf = se.createDataFrame(trainData, StructType(srcDataSchema))

In [16]:
trainDf.show()

[Stage 8:>                                                          (0 + 1) / 1]

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(4469,[35,3572,44...|  4.0|
|(4469,[60,3572,44...|  2.0|
|(4469,[60,3572,44...|  1.0|
|(4469,[60,3572,44...|  0.0|
|(4469,[35,3572,44...| 12.0|
|(4469,[60,3572,44...|  1.0|
|(4469,[35,3572,44...| 18.0|
|(4469,[60,3572,44...|  7.0|
|(4469,[60,3572,44...|  4.0|
|(4469,[60,3572,44...|  3.0|
|(4469,[60,3572,44...|  4.0|
|(4469,[60,3572,44...|  7.0|
|(4469,[60,3572,44...|  5.0|
|(4469,[60,3572,44...|  3.0|
|(4469,[60,3572,44...|  1.0|
|(4469,[60,3572,44...|  7.0|
|(4469,[60,3572,44...|  3.0|
|(4469,[60,3572,44...|  2.0|
|(4469,[60,3572,44...|  1.0|
|(4469,[60,3572,44...|  4.0|
+--------------------+-----+
only showing top 20 rows



                                                                                

In [17]:
evalData = test_proc.rdd.map(lambda x: Row(x.features, float(x.usefulCount)))

In [18]:
evalDf = se.createDataFrame(evalData, StructType(srcDataSchema))

In [19]:
evalDf.show()

[Stage 9:>                                                          (0 + 1) / 1]

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(4469,[60,3572,44...|  2.0|
|(4469,[35,3572,44...| 10.0|
|(4469,[60,3572,44...|  3.0|
|(4469,[60,3572,44...|  2.0|
|(4469,[1379,3655,...| 16.0|
|(4469,[60,3572,44...|  3.0|
|(4469,[60,3572,44...|  1.0|
|(4469,[60,3572,44...|  2.0|
|(4469,[60,3572,44...|  4.0|
|(4469,[60,3572,44...|  1.0|
|(4469,[35,3572,44...| 23.0|
|(4469,[60,3572,44...| 13.0|
|(4469,[60,3572,44...|  6.0|
|(4469,[60,3572,44...|  1.0|
|(4469,[35,3572,44...| 18.0|
|(4469,[60,3572,44...|  8.0|
|(4469,[60,3572,44...|  1.0|
|(4469,[60,3572,44...| 15.0|
|(4469,[35,3572,44...| 36.0|
|(4469,[35,3572,44...| 14.0|
+--------------------+-----+
only showing top 20 rows



                                                                                

In [20]:
trainPool = catboost_spark.Pool(trainDf)
evalPool = catboost_spark.Pool(evalDf)

In [21]:
regressor = catboost_spark.CatBoostRegressor()

In [None]:
model = regressor.fit(trainPool, [evalPool])

2022-03-10 07:42:09,407 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 1130.1 KiB
2022-03-10 07:42:09,506 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 1130.1 KiB
2022-03-10 07:42:57,714 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 1132.9 KiB
2022-03-10 07:43:17,781 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 1132.9 KiB
2022-03-10 07:43:25,300 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 1130.1 KiB
[CatBoost Master] SLF4J: Class path contains multiple SLF4J bindings.0 + 2) / 2]
[CatBoost Master] SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
[CatBoost Master] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
[CatBoost Master] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
[CatBoost Master]

Learning rate set to 0.121771
0:	learn: 43.8142328	test: 43.4426164	best: 43.4426164 (0)	total: 1.17s	remaining: 19m 23s
1:	learn: 41.7584243	test: 41.3465237	best: 41.3465237 (1)	total: 2.36s	remaining: 19m 36s
2:	learn: 40.0713687	test: 39.6346834	best: 39.6346834 (2)	total: 3.55s	remaining: 19m 40s
3:	learn: 38.7774662	test: 38.3169703	best: 38.3169703 (3)	total: 4.69s	remaining: 19m 27s
4:	learn: 37.6330825	test: 37.1475603	best: 37.1475603 (4)	total: 5.86s	remaining: 19m 26s
5:	learn: 36.7399274	test: 36.2343583	best: 36.2343583 (5)	total: 7.03s	remaining: 19m 25s
6:	learn: 36.0410373	test: 35.5241858	best: 35.5241858 (6)	total: 8.2s	remaining: 19m 23s
7:	learn: 35.4718411	test: 34.9501786	best: 34.9501786 (7)	total: 9.43s	remaining: 19m 29s
8:	learn: 35.0216027	test: 34.4818404	best: 34.4818404 (8)	total: 10.6s	remaining: 19m 25s
9:	learn: 34.6357441	test: 34.0914359	best: 34.0914359 (9)	total: 11.7s	remaining: 19m 23s
10:	learn: 34.3342962	test: 33.7915892	best: 33.7915892 (10)	

[Stage 46:>                                                         (0 + 2) / 2]

earn: 32.3031114	test: 31.7942600	best: 31.7942600 (44)	total: 51.8s	remaining: 18m 18s
45:	learn: 32.2843151	test: 31.7753910	best: 31.7753910 (45)	total: 52.9s	remaining: 18m 16s
46:	learn: 32.2663985	test: 31.7544299	best: 31.7544299 (46)	total: 54s	remaining: 18m 15s
47:	learn: 32.2483379	test: 31.7362071	best: 31.7362071 (47)	total: 55.1s	remaining: 18m 13s
48:	learn: 32.2248033	test: 31.7210291	best: 31.7210291 (48)	total: 56.3s	remaining: 18m 12s
49:	learn: 32.2080067	test: 31.7121591	best: 31.7121591 (49)	total: 57.5s	remaining: 18m 11s
50:	learn: 32.1913737	test: 31.6965690	best: 31.6965690 (50)	total: 58.6s	remaining: 18m 10s
51:	learn: 32.1747413	test: 31.6804050	best: 31.6804050 (51)	total: 59.8s	remaining: 18m 9s
52:	learn: 32.1580301	test: 31.6617793	best: 31.6617793 (52)	total: 1m	remaining: 18m 8s
53:	learn: 32.1425560	test: 31.6505607	best: 31.6505607 (53)	total: 1m 2s	remaining: 18m 7s
54:	learn: 32.1257225	test: 31.6367433	best: 31.6367433 (54)	total: 1m 3s	remaining

In [None]:
predictions = model.transform(evalPool.data)
predictions.show()