![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **GenericSVMClassifierApproach**

This notebook will cover the different parameters and usages of `GenericSVMClassifierApproach`.


**📖 Learning Objectives:**
1. Understand how trains a TensorFlow model for `SVMClassifier` of feature vectors.

2. Become comfortable using the different parameters of the annotator.
**🔗 Helpful Links:**

- Documentation : [GenericSVMClassifierApproach](https://colab.research.google.com/drive/1SCD1jaX-8-GVMdupJ0xEblJjew798NeF#scrollTo=qeclGJmrVLjX)

- Python Docs : [GenericSVMClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/generic_svm_classifier/index.html#module-contents)

- For extended examples of usage, see [Spark NLP Workshop repository](https://colab.research.google.com/drive/12JQuR7cSUeYfCu418REcqkSmkYWq8Efb#scrollTo=Ltt3uq8y9MKR).


## **📜 Background**

`GenericSVMClassifier` is a derivative of [GenericClassifier](https://colab.research.google.com/drive/1O96phialslGizxSt6a0F7KSy-RUvJtTS?usp=sharing) which implements [SVM (Support Vector Machine) classification](https://en.wikipedia.org/wiki/Support_vector_machine). The input to the model is FeatureVector and the output is category annotations with labels and corresponding confidence scores. The scores are standardized using the logistic function so that they vary between 0 and 1.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
!pip install -q tensorflow==2.12.0 tensorflow_addons

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8448_floating.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.4, 💊Spark-Healthcare==5.1.3, running on ⚡ PySpark==3.1.2


In [None]:
spark

## **🖨️ Input/Output Annotation Types**



- Input: `FEATURE_VECTOR`

- Output: `CATEGORY`

## **🔎 Parameters**

- `batchSize`: (int) Batch size

- `dropout`: (float) Dropout coefficient

- `epochsNumber`: (int) Maximum number of epochs to train

- `featureScaling`: (str) Feature scaling method. Possible values are 'zscore', 'minmax' or empty (no scaling)

- `fixImbalance`: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories

- `labelColumn`: (str) Column with label per each document

- `learningRate`: (float) Learning Rate

- `modelFile`: (str) Location of file of the model used for classification

- `multiClass`: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.

- `outputLogsPath`: (str) Folder path to save training logs. If no path is specified, the logs won't be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

- `validationSplit`: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0.

# ✍  Explaining GenericSVMClassifierApproach with Example

## 📑 Sample Data

In [None]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

In [None]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg.head()

Unnamed: 0,col1
0,6460590 NEG Clioquinol intoxication occurring ...
1,"8600337 NEG ""Retinoic acid syndrome"" was preve..."
2,8402502 NEG BACKGROUND: External beam radiatio...
3,"8700794 NEG Although the enuresis ceased, she ..."
4,17662448 NEG A 42-year-old woman had uneventfu...


In [None]:
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]
df_neg.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


In [None]:
df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,10030778,Intravenous azithromycin-induced ototoxicity.,ototoxicity,43,54,azithromycin,22,34
1,10048291,"Immobilization, while Paget's bone disease was...",increased calcium-release,960,985,dihydrotachysterol,908,926
2,10048291,Unaccountable severe hypercalcemia in a patien...,hypercalcemia,31,44,dihydrotachysterol,94,112
3,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,naproxen,646,654
4,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,oxaprozin,659,668


In [None]:
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]
df_pos.head()

Unnamed: 0,text,category
0,Intravenous azithromycin-induced ototoxicity.,pos
1,"Immobilization, while Paget's bone disease was...",pos
2,Unaccountable severe hypercalcemia in a patien...,pos
3,METHODS: We report two cases of pseudoporphyri...,pos
4,METHODS: We report two cases of pseudoporphyri...,pos


In [None]:
ade_df= pd.concat([df_neg, df_pos])
ade_df.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


In [None]:
ade_df["category"].value_counts()


neg    16695
pos     6821
Name: category, dtype: int64

In [None]:
#We will take 50% of data to make a fast training

spark_df = spark.createDataFrame(ade_df).sample(0.5, 42) # limit the data

trainingData, testData = spark_df.randomSplit([0.8, 0.2], seed = 100)

print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 9423
Test Dataset Count: 2399


In [None]:
spark_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)



In [None]:
spark_df.head(3)


[Row(text=' Although the enuresis ceased, she developed throbbing headaches, nausea, vomiting, paresthesia, lethargy, fatigue, and altered mental status over the next 7 days.', category='neg'),
 Row(text=' Differences in interpretation by patients and healthcare professionals may cause the observed disparities in seriousness and outcome of reported ADRs.', category='neg'),
 Row(text=' The decedent was also prescribed tramadol, gabapentin, cetirizine, modafinil, carisoprodol, and Xyrem.', category='neg')]

## Preparing input: **FEATURE_VECTOR** for SVM Classifier

***We will use 100d embeddings as input, so text will be vectorized after pipeline below***

In [None]:
document_assembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
            .setInputCols(["document"]) \
            .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
        .setInputCols(["document","token"])\
        .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
        .setInputCols(["document", "word_embeddings"]) \
        .setOutputCol("sentence_embeddings") \
        .setPoolingStrategy("AVERAGE")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,

    ])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [None]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+------------------------------------------------------------+------------------------------------------------------------+------------------------------------------------------------+
|                                                        text|category|                                                    document|                                                       token|                                             word_embeddings|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+------------------------------------------------------------+------------------------------------------------------------+------------------------------------------------------------+
| A 15-year-old boy had temporary hypertropia, supra

In [None]:
# we will take only sentence embedings per sentence

trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| A 15-year-old boy had temporary hypertropia, supraductio...|     neg|[{sentence_embeddings, 0, 262,  A 15-year-old boy had tem...|
| A 17-year-old anuric female patient with end-stage renal...|     neg|[{sentence_embeddings, 0, 185,  A 17-year-old anuric fema...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)
testData_with_embeddings = testData_with_embeddings.select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| A 29-year-old woman who underwent laser in situ keratomi...|     neg|[{sentence_embeddings, 0, 216,  A 29-year-old woman who u...|
| A 52-year-old woman was seen in our outpatient clinic fo...|     neg|[{sentence_embeddings, 0, 198,  A 52-year-old woman was s...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings.printSchema()


root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [None]:
testData_with_embeddings.select(testData_with_embeddings.sentence_embeddings.embeddings).show(3,truncate=120)


+------------------------------------------------------------------------------------------------------------------------+
|                                                                                          sentence_embeddings.embeddings|
+------------------------------------------------------------------------------------------------------------------------+
|[[-0.09864674, 0.021150807, 0.1699883, 0.113804236, -0.05979607, -0.1667684, 0.09705255, 0.08950208, 0.011517063, -0....|
|[[0.05888334, 0.1264814, 0.07519952, 0.17283632, -0.20441434, -0.07624834, 0.20318455, -0.026616767, -0.076673746, 0....|
|[[0.057804022, 0.047698125, 0.06304918, 0.15662706, -0.04416214, -0.06267688, 0.08489624, -0.0543084, -0.0027059284, ...|
+------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



## **💻Pipeline**

In [None]:
log_folder="logs"
!mkdir -p $log_folder

### **✅ Graph Creation**

In [None]:
graph_folder = "graph_folder"

gc_svm_graph_builder = medical.TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

### ⏰**Training with parameters**

```{Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='labelColumn', doc='Column with one label per document'): 'label',
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='batchSize', doc='Size for each batch in the optimization process'): 32,
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='epochsN', doc='Number of epochs for the optimization process'): 10,
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='learningRate', doc='Learning rate for the optimization process'): 0.001,
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='dropout', doc='Dropout at the output of each layer'): 0.05,
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='fixImbalance', doc='A flag indicating whenther to balance the trainig set'): False,
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='featureScaling', doc="Feature scaling method. Possible values are 'zscore', 'minmax' or empty (no scaling)"): '',
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='outputLogsPath', doc='Path to folder where logs will be saved. If no path is specified, no logs are generated'): '',
 Param(parent='GenericSVMClassifierApproach_ad6f332b1585', name='multiClass', doc='Whether to return only the label with the highest confidence score or all labels'): False}
```

We will train a new model with GenericSVMClassifierApproach by assesing each parameters.
Parameters are in two categories:
`labelColumn` , `modelFile`, `outputLogPaths` dont have any effect on model performance or quality. So we will not make any tuning on them.


### ▶ `epochsNumber`:

 How many times will our algorithm see the entire dataset and update its weights.

In [None]:
features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(2)\
    .setBatchSize(128)\
    .setLearningRate(0.015)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])


In [None]:
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb


In [None]:
! ls -t $log_folder | head -n 1
! cat $log_folder/$(ls -t $log_folder | head -n 1)


GenericSVMClassifierApproach_051ce6e2e854.log
Training 2 epochs
Epoch 1/2	0.13s	Loss: 44.346165	ACC: 0.7142985
Epoch 2/2	0.07s	Loss: 40.961517	ACC: 0.735992
Training completed.


As seen above, 2 epoch takes .20 s and reached accuracy .735

In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.015)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])

In [None]:
%%time
clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb
CPU times: user 488 ms, sys: 19.2 ms, total: 507 ms
Wall time: 5.69 s


PipelineModel_30e7843d4669

In [None]:
! cat $log_folder/$(ls -t $log_folder | head -n 1)

Training 30 epochs
Epoch 1/30	0.13s	Loss: 45.24362	ACC: 0.70614386
Epoch 2/30	0.07s	Loss: 41.071648	ACC: 0.7361016
Epoch 3/30	0.07s	Loss: 40.13074	ACC: 0.7475544
Epoch 4/30	0.08s	Loss: 39.82691	ACC: 0.7580972
Epoch 5/30	0.07s	Loss: 39.91341	ACC: 0.754668
Epoch 6/30	0.07s	Loss: 39.645798	ACC: 0.75789666
Epoch 7/30	0.07s	Loss: 39.610588	ACC: 0.7598117
Epoch 8/30	0.07s	Loss: 39.639378	ACC: 0.7638383
Epoch 9/30	0.07s	Loss: 39.57259	ACC: 0.7632556
Epoch 10/30	0.07s	Loss: 39.657555	ACC: 0.7652909
Epoch 11/30	0.07s	Loss: 39.3809	ACC: 0.7644717
Epoch 12/30	0.07s	Loss: 39.323936	ACC: 0.7602233
Epoch 13/30	0.07s	Loss: 39.669544	ACC: 0.7620181
Epoch 14/30	0.07s	Loss: 39.400383	ACC: 0.76822156
Epoch 15/30	0.07s	Loss: 39.99315	ACC: 0.76054007
Epoch 16/30	0.08s	Loss: 39.142174	ACC: 0.7686038
Epoch 17/30	0.07s	Loss: 40.122463	ACC: 0.7610025
Epoch 18/30	0.07s	Loss: 39.373882	ACC: 0.7661208
Epoch 19/30	0.07s	Loss: 39.80465	ACC: 0.7623495
Epoch 20/30	0.07s	Loss: 39.46946	ACC: 0.76378345
Epoch 21/30	0.07

30 epoch takes 5.69 s and reached accuracy  .764

### ▶ `batchsize`

 how many saples are proccessed in paralel during the learning, it depends on computational power and dataset itself, the smaller batchsize, the longer training time and the smaller memory required

In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(4)\
    .setLearningRate(0.015)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])




In [None]:
%%time
clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb
CPU times: user 666 ms, sys: 41.9 ms, total: 707 ms
Wall time: 37.9 s


PipelineModel_9909ef312182

In [None]:
! cat $log_folder/$(ls -t $log_folder | head -n 1)

Training 30 epochs
Epoch 1/30	1.20s	Loss: 1369.9314	ACC: 0.744942
Epoch 2/30	1.08s	Loss: 1373.536	ACC: 0.7481607
Epoch 3/30	1.13s	Loss: 1353.2765	ACC: 0.75541174
Epoch 4/30	1.12s	Loss: 1351.0005	ACC: 0.7580645
Epoch 5/30	1.09s	Loss: 1355.1453	ACC: 0.7518746
Epoch 6/30	1.13s	Loss: 1354.006	ACC: 0.75318336
Epoch 7/30	1.11s	Loss: 1364.4591	ACC: 0.75180393
Epoch 8/30	1.10s	Loss: 1359.5323	ACC: 0.75463355
Epoch 9/30	1.12s	Loss: 1358.5867	ACC: 0.75180393
Epoch 10/30	1.14s	Loss: 1359.6768	ACC: 0.7512379
Epoch 11/30	1.16s	Loss: 1345.3046	ACC: 0.75028294
Epoch 12/30	1.15s	Loss: 1346.5883	ACC: 0.75325406
Epoch 13/30	1.10s	Loss: 1361.4282	ACC: 0.7518746
Epoch 14/30	1.17s	Loss: 1345.9159	ACC: 0.7560838
Epoch 15/30	1.10s	Loss: 1370.4236	ACC: 0.74759483
Epoch 16/30	1.10s	Loss: 1337.047	ACC: 0.75491655
Epoch 17/30	1.11s	Loss: 1356.4893	ACC: 0.75382006
Epoch 18/30	1.09s	Loss: 1360.2317	ACC: 0.75328946
Epoch 19/30	1.12s	Loss: 1351.7175	ACC: 0.7526528
Epoch 20/30	1.14s	Loss: 1338.6552	ACC: 0.7578169
Epo



*   in last two run, we set batch size to 128 and 4.
*   training time increased from 4.86 to 37.9 s with same epoch number




In [None]:
pred_df = model.transform(testData_with_embeddings)


In [None]:
pred_df.printSchema()
pred_df.select(pred_df.prediction).show(5, truncate=False)
pred_df.select(pred_df.category, pred_df.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- feature_vector: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |

### ▶ `learningrate`

`The learning rate is a hyperparameter in deep learning and many other machine learning algorithms that determines the step size at which the model's weights are updated during training. It is one of the most critical hyperparameters to tune because it has a profound impact on the training process and the final performance of the model.`

In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])




In [None]:
%%time
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb
CPU times: user 475 ms, sys: 40.7 ms, total: 516 ms
Wall time: 5.82 s


In [None]:
! cat $log_folder/$(ls -t $log_folder | head -n 1)

Training 30 epochs
Epoch 1/30	0.13s	Loss: 63.266453	ACC: 0.6905082
Epoch 2/30	0.08s	Loss: 45.79635	ACC: 0.71287525
Epoch 3/30	0.08s	Loss: 44.067856	ACC: 0.7126935
Epoch 4/30	0.07s	Loss: 43.221058	ACC: 0.7138254
Epoch 5/30	0.08s	Loss: 42.866653	ACC: 0.71371984
Epoch 6/30	0.08s	Loss: 42.479755	ACC: 0.7151431
Epoch 7/30	0.08s	Loss: 42.327938	ACC: 0.7147756
Epoch 8/30	0.10s	Loss: 42.053	ACC: 0.7160572
Epoch 9/30	0.10s	Loss: 41.963844	ACC: 0.7157953
Epoch 10/30	0.09s	Loss: 41.887173	ACC: 0.7163993
Epoch 11/30	0.08s	Loss: 41.61777	ACC: 0.71952647
Epoch 12/30	0.07s	Loss: 41.61414	ACC: 0.7205862
Epoch 13/30	0.08s	Loss: 41.280735	ACC: 0.72134656
Epoch 14/30	0.08s	Loss: 41.202465	ACC: 0.7227338
Epoch 15/30	0.08s	Loss: 41.19994	ACC: 0.7234728
Epoch 16/30	0.08s	Loss: 41.111885	ACC: 0.72556955
Epoch 17/30	0.08s	Loss: 40.700535	ACC: 0.7287622
Epoch 18/30	0.08s	Loss: 40.815052	ACC: 0.7278668
Epoch 19/30	0.08s	Loss: 40.689014	ACC: 0.7289079
Epoch 20/30	0.08s	Loss: 40.7295	ACC: 0.72763026
Epoch 21/30	0

30 epochs, 128 batch size and 0.001 learning  ▶ 4.04 s trainin time and .729 accuracy

In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.00001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])

In [None]:
%%time
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb
CPU times: user 487 ms, sys: 26.9 ms, total: 514 ms
Wall time: 5.71 s


In [None]:
! ls -t $log_folder | head -n 1
! cat $log_folder/$(ls -t $log_folder | head -n 1)

GenericSVMClassifierApproach_648c063ac472.log
Training 30 epochs
Epoch 1/30	0.13s	Loss: 82.54853	ACC: 0.33559406
Epoch 2/30	0.08s	Loss: 81.92425	ACC: 0.34051597
Epoch 3/30	0.07s	Loss: 81.43048	ACC: 0.35108808
Epoch 4/30	0.07s	Loss: 80.90814	ACC: 0.35859457
Epoch 5/30	0.07s	Loss: 80.25813	ACC: 0.3625155
Epoch 6/30	0.07s	Loss: 79.60011	ACC: 0.379815
Epoch 7/30	0.07s	Loss: 79.189514	ACC: 0.3871758
Epoch 8/30	0.07s	Loss: 78.62741	ACC: 0.39786822
Epoch 9/30	0.08s	Loss: 78.028305	ACC: 0.4155245
Epoch 10/30	0.08s	Loss: 77.59669	ACC: 0.42686105
Epoch 11/30	0.08s	Loss: 77.0726	ACC: 0.4451361
Epoch 12/30	0.08s	Loss: 76.62101	ACC: 0.45760855
Epoch 13/30	0.08s	Loss: 76.11975	ACC: 0.46958923
Epoch 14/30	0.08s	Loss: 75.80697	ACC: 0.4820323
Epoch 15/30	0.08s	Loss: 75.261925	ACC: 0.4988841
Epoch 16/30	0.07s	Loss: 74.8803	ACC: 0.5162531
Epoch 17/30	0.07s	Loss: 74.444275	ACC: 0.53161216
Epoch 18/30	0.07s	Loss: 74.27799	ACC: 0.53538746
Epoch 19/30	0.07s	Loss: 73.84354	ACC: 0.55228597
Epoch 20/30	0.07s	Lo

30 epochs, 128 batch size and 0.001 learning  ▶ 5.82 s trainin time and .740 accuracy

30 epochs, 128 batch size and 0.00001 learning  ▶ 5.71 s trainin time and .663 accuracy

because we are using small amount of data for training purposes, there is small differences in metrics, but normally, learning rate is very crucial.
in this demo, we see the accuracy decreased to sharply, smaller learning rate needs more epoch to reach optimums

### ▶`featureScaling`

In [None]:
rdf = pd.DataFrame()

In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb


In [None]:
result_ = model.transform(trainingData_with_embeddings)

In [None]:
result_.select(result_.text, result_.prediction.metadata).show(5, truncate=50)

+--------------------------------------------------+----------------------------+
|                                              text|         prediction.metadata|
+--------------------------------------------------+----------------------------+
| A 15-year-old boy had temporary hypertropia, s...|[{confidence -> 0.78144157}]|
| A 17-year-old anuric female patient with end-s...|[{confidence -> 0.76712304}]|
| A 28-year-old female patient complained of int...|[{confidence -> 0.79150796}]|
| A 31 year old woman presented with a chest wal...|[{confidence -> 0.84184587}]|
| A 43-year-old patient with preexisting mitral ...| [{confidence -> 0.8219223}]|
+--------------------------------------------------+----------------------------+
only showing top 5 rows



In [None]:
result_.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- feature_vector: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |

In [None]:
rdf_ = result_.select(result_.text,result_.category, result_.prediction.result[0],result_.prediction.metadata[0]).toPandas()
rdf_.columns  = ['text','category', 'predicted category', 'conf']
rdf_['confidence'] = rdf_.conf.apply(lambda conf:float(conf['confidence']))
rdf_ = rdf_[['text','category', 'predicted category', 'confidence']]
rdf = rdf_[['text','category', 'predicted category']]
rdf['conf_'] = rdf_['confidence']
rdf.head()

Unnamed: 0,text,category,predicted category,conf_
0,"A 15-year-old boy had temporary hypertropia, ...",neg,neg,0.781442
1,A 17-year-old anuric female patient with end-...,neg,neg,0.767123
2,A 28-year-old female patient complained of in...,neg,neg,0.791508
3,A 31 year old woman presented with a chest wa...,neg,neg,0.841846
4,A 43-year-old patient with preexisting mitral...,neg,neg,0.821922


In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setFeatureScaling('zscore')\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb


In [None]:
result_zscore = model.transform(trainingData_with_embeddings)

In [None]:
rdf_zscore = result_zscore.select(result_zscore.text,result_zscore.category, result_zscore.prediction.result[0],result_zscore.prediction.metadata[0]).toPandas()
rdf_zscore.columns  = ['text','category', 'predicted category', 'conf']
rdf_zscore['confidence'] = rdf_zscore.conf.apply(lambda conf:float(conf['confidence']))
rdf['conf_zscore'] = rdf_zscore['confidence']

rdf.head()

Unnamed: 0,text,category,predicted category,conf_,conf_zscore
0,"A 15-year-old boy had temporary hypertropia, ...",neg,neg,0.781442,0.759435
1,A 17-year-old anuric female patient with end-...,neg,neg,0.767123,0.632024
2,A 28-year-old female patient complained of in...,neg,neg,0.791508,0.665319
3,A 31 year old woman presented with a chest wa...,neg,neg,0.841846,0.79464
4,A 43-year-old patient with preexisting mitral...,neg,neg,0.821922,0.857194


In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setFeatureScaling('minmax')\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb


In [None]:
result_minmax = model.transform(trainingData_with_embeddings)


In [None]:
rdf_minmax = result_minmax.select(result_minmax.text,result_minmax.category, result_minmax.prediction.result[0],result_minmax.prediction.metadata[0]).toPandas()
rdf_minmax.columns  = ['text','category', 'predicted category', 'conf']
rdf_minmax['confidence'] = rdf_minmax.conf.apply(lambda conf:float(conf['confidence']))

rdf['conf_minmax'] = rdf_minmax['confidence']
rdf.head()

Unnamed: 0,text,category,predicted category,conf_,conf_zscore,conf_minmax
0,"A 15-year-old boy had temporary hypertropia, ...",neg,neg,0.781442,0.759435,0.838469
1,A 17-year-old anuric female patient with end-...,neg,neg,0.767123,0.632024,0.836075
2,A 28-year-old female patient complained of in...,neg,neg,0.791508,0.665319,0.835654
3,A 31 year old woman presented with a chest wa...,neg,neg,0.841846,0.79464,0.834675
4,A 43-year-old patient with preexisting mitral...,neg,neg,0.821922,0.857194,0.809086


**As shown above, three different featureScaling option generate different confidence, and the best one may be chosen according to problem/dataset**

### ▶`ValidationSplit`

 when `.setValidationSplit(0.1)` set, logs are also having validation scores, by doing so, we can see our model quality during the training process by comparing validatin accuracy score and training accuracy score.

In [None]:
! ls -t $log_folder | head -n 1
! cat $log_folder/$(ls -t $log_folder | head -n 1)

GenericSVMClassifierApproach_4c65c6c5682d.log
Training 30 epochs
Epoch 1/30	0.15s	Loss: 46.637627	ACC: 0.70650655	Validation ACC: 0.7295864
Epoch 2/30	0.08s	Loss: 43.55682	ACC: 0.7100047	Validation ACC: 0.7295864
Epoch 3/30	0.07s	Loss: 42.528988	ACC: 0.7118703	Validation ACC: 0.7295864
Epoch 4/30	0.07s	Loss: 42.17951	ACC: 0.70942163	Validation ACC: 0.7295864
Epoch 5/30	0.07s	Loss: 41.408833	ACC: 0.7109375	Validation ACC: 0.7295864
Epoch 6/30	0.07s	Loss: 41.22085	ACC: 0.7114039	Validation ACC: 0.7295864
Epoch 7/30	0.07s	Loss: 40.84143	ACC: 0.7108209	Validation ACC: 0.7295864
Epoch 8/30	0.07s	Loss: 40.79271	ACC: 0.7100047	Validation ACC: 0.7295864
Epoch 9/30	0.07s	Loss: 40.288013	ACC: 0.7105877	Validation ACC: 0.7295864
Epoch 10/30	0.07s	Loss: 40.098537	ACC: 0.7110541	Validation ACC: 0.7295864
Epoch 11/30	0.07s	Loss: 39.95572	ACC: 0.7110541	Validation ACC: 0.7295864
Epoch 12/30	0.07s	Loss: 39.82334	ACC: 0.71163714	Validation ACC: 0.7295864
Epoch 13/30	0.07s	Loss: 39.824547	ACC: 0.7102378

### ▶ `multiClass`

Whether to return only the label with the highest confidence score or all labels

In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setFeatureScaling('minmax')\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    .setValidationSplit(0.1)\
    .setMultiClass(True)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb


In [None]:
preds_df = model.transform(testData_with_embeddings)

In [None]:
preds_df.select('prediction').show(5, truncate=False)

+--------------------------------------------------------------------------------------------------------------+
|prediction                                                                                                    |
+--------------------------------------------------------------------------------------------------------------+
|[{category, 0, 0, pos, {confidence -> 0.6204573}, []}, {category, 0, 0, neg, {confidence -> 0.82777494}, []}] |
|[{category, 0, 0, pos, {confidence -> 0.66048753}, []}, {category, 0, 0, neg, {confidence -> 0.85109913}, []}]|
|[{category, 0, 0, pos, {confidence -> 0.6559448}, []}, {category, 0, 0, neg, {confidence -> 0.8332604}, []}]  |
|[{category, 0, 0, pos, {confidence -> 0.6407703}, []}, {category, 0, 0, neg, {confidence -> 0.8241285}, []}]  |
|[{category, 0, 0, pos, {confidence -> 0.66346145}, []}, {category, 0, 0, neg, {confidence -> 0.8514608}, []}] |
+-----------------------------------------------------------------------------------------------

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.73      0.97      0.83      1704
         pos       0.65      0.12      0.20       695

    accuracy                           0.73      2399
   macro avg       0.69      0.55      0.52      2399
weighted avg       0.71      0.73      0.65      2399



### ▶ `setFixImbalance`

default False
when set True:
Algorithm duplicates the minor classes randomly

In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setFeatureScaling('minmax')\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(False)\
    .setMultiClass(False)
    # .setValidationSplit(0.1)\

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb


In [None]:
preds_df = model.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.73      0.97      0.83      1704
         pos       0.65      0.12      0.20       695

    accuracy                           0.73      2399
   macro avg       0.69      0.55      0.52      2399
weighted avg       0.71      0.73      0.65      2399



In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setFeatureScaling('minmax')\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    .setMultiClass(False)
    # .setValidationSplit(0.1)\

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])

model = clf_Pipeline.fit(trainingData_with_embeddings)

preds_df = model.transform(testData_with_embeddings)
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb
              precision    recall  f1-score   support

         neg       0.73      0.97      0.83      1704
         pos       0.65      0.12      0.20       695

    accuracy                           0.73      2399
   macro avg       0.69      0.55      0.52      2399
weighted avg       0.71      0.73      0.65      2399



### ▶ `setDropout`

Dropout is a regularization technique commonly used in deep learning algorithms, especially in neural networks, to prevent overfitting and improve the generalization ability of the model. The "dropout ratio" is a key hyperparameter associated with dropout, and it refers to the probability that a particular neuron or unit in a neural network layer will be "dropped out" or temporarily removed during each training iteration

**`default:0.05`**

In [None]:
# letx compare 0.015 and 0.00

In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setFeatureScaling('minmax')\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.015)\
    .setFixImbalance(True)\
    .setMultiClass(False)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])
model = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb


In [None]:
preds_df = model.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.73      0.97      0.83      1704
         pos       0.65      0.12      0.20       695

    accuracy                           0.73      2399
   macro avg       0.69      0.55      0.52      2399
weighted avg       0.71      0.73      0.65      2399



In [None]:
gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setFeatureScaling('minmax')\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.00)\
    .setFixImbalance(True)\
    .setMultiClass(False)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])
model = clf_Pipeline.fit(trainingData_with_embeddings)


TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: graph_folder
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to graph_folder/svm_graph.pb


In [None]:
preds_df = model.transform(testData_with_embeddings)
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.71      1.00      0.83      1704
         pos       0.71      0.02      0.03       695

    accuracy                           0.71      2399
   macro avg       0.71      0.51      0.43      2399
weighted avg       0.71      0.71      0.60      2399



***f1-score for pos changed dramatically***