![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **GenericSVMClassifierModel**

This notebook will cover the different parameters and usages of `GenericSVMClassifierModel`.

**📖 Learning Objectives:**

1. Background: Understand the 'GenericSVMClassifierModel' annotator.

2. Colab setup.

3. Become comfortable with using the different parameters of the annotator.

**🔗 Helpful Links:**

- Python Docs : [GenericSVMClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/generic_svm_classifier/index.html#sparknlp_jsl.annotator.classification.generic_svm_classifier.GenericSVMClassifierModel)

- Scala Docs: [GenericSVMClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/classification/GenericSVMClassifierModel.html)

- For extended examples of usage, see [Spark NLP Workshop repository](https://colab.research.google.com/drive/12JQuR7cSUeYfCu418REcqkSmkYWq8Efb#scrollTo=Ltt3uq8y9MKR).


## **📜 Background**

`GenericSVMClassifier` is a derivative of [GenericClassifier](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#genericclassifier), which implements [SVM (Support Vector Machine) classification](https://en.wikipedia.org/wiki/Support_vector_machine).

The input to the model is the `FeatureVector` and the output is category annotations with labels and corresponding confidence scores.

The scores are standardized using the logistic function so that they vary between 0 and 1.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs==5.1.0

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

## **🖨️ Input/Output Annotation Types**



- Input: `FEATURE_VECTOR`

- Output: `CATEGORY`

## **🔎 Parameters**

`featureScaling`: Feature scaling method. Possible values are '`zscore`', '`minmax`' or empty (no scaling) (default: )

`multiClass`: Whether to return only the label with the highest confidence score or all labels (default: False, current: False)


`inputCols`: previous annotations columns, if renamed (current: ['features'])

`outputCol`: output annotation column. can be left default. (current: class)



# ✍  Explaining GenericSVMClassifierModel with an Example

## 📑 Sample Data

In [None]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

In [None]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]

df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]

ade_df= pd.concat([df_neg, df_pos])
ade_df.category = ade_df.category.apply(lambda x: True if x=="pos" else False)
ade_df.sample(5, random_state=42)

Unnamed: 0,text,category
11571,"Nineteen months after diagnosis, her mother g...",False
4577,"An infant girl with choanal atresia, athelia, ...",True
5638,A causal relationship between the administrat...,False
12799,The absence of eosinophils does not exclude t...,False
427,While for ribavirin antidepressant effects are...,True


In [None]:
from pyspark.sql.types import StringType
sample_data = ade_df.sample(n= 100, random_state=1)  # taking 100 rows as sample data
sample_data = sample_data.reset_index(drop=True)
df = spark.createDataFrame(sample_data['text'].tolist(), StringType()).toDF("text")
df.show(5,truncate=60)

+------------------------------------------------------------+
|                                                        text|
+------------------------------------------------------------+
| Multicentric canine lymphoma in a 12-year-old keeshond: ...|
| Pyomyositis is a rare disease, encountered mainly in tro...|
| Both patients subsequently developed markedly elevated E...|
|Bortezomib-induced paralytic ileus is a potential gastroi...|
|However, given the clinically significant result to the i...|
+------------------------------------------------------------+
only showing top 5 rows



In [None]:
sample_data.head()

Unnamed: 0,text,category
0,Multicentric canine lymphoma in a 12-year-old...,False
1,"Pyomyositis is a rare disease, encountered ma...",False
2,Both patients subsequently developed markedly...,False
3,Bortezomib-induced paralytic ileus is a potent...,True
4,"However, given the clinically significant resu...",True


## **💻Pipeline**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

generic_classifier = medical.GenericSVMClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("class")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    features_asm,
    generic_classifier])

data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""],
 ["""I feel a bit drowsy & have a little blurred vision after taking an insulin"""]]).toDF("text")



In [None]:
generic_classifier

GenericSVMClassifierModel_5b8f3a210253

In [None]:
result = clf_Pipeline.fit(data).transform(df)
# result.select("text", "class.result").show(truncate=100)

In [None]:
result.select("text", "class.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...|[False]|
|However, given the clinically significant result to the interaction between tolazoline and cimeti...| [True]|
|                                              How much do novel antipsychotics benefit the patients?|[False]|
|

In [None]:
preds_df = result.select("text", "class.result").toPandas()
preds_df['result'] = preds_df['result'].apply(lambda x : True if x[0]=='True' else False)
preds_df


Unnamed: 0,text,result
0,Multicentric canine lymphoma in a 12-year-old...,False
1,"Pyomyositis is a rare disease, encountered ma...",False
2,Both patients subsequently developed markedly...,False
3,Bortezomib-induced paralytic ileus is a potent...,False
4,"However, given the clinically significant resu...",True
...,...,...
95,Subsequently he responded to the West Berlin ...,False
96,All three patients likely experienced electro...,False
97,We report a case of codeine intoxication in th...,False
98,METHODS: The clinical suspicion of chickenpox...,False


In [None]:
from sklearn.metrics import  classification_report
print (classification_report(sample_data['category'], preds_df['result']))

              precision    recall  f1-score   support

       False       0.85      0.93      0.89        72
        True       0.76      0.57      0.65        28

    accuracy                           0.83       100
   macro avg       0.81      0.75      0.77       100
weighted avg       0.82      0.83      0.82       100



### ⏰Running the **GenericSVMClassifierModel** with parameters

In [None]:
print(generic_classifier.explainParams())

classes: Categorization classes (current: ['False', 'True'])
featureScaling: Feature scaling method. Possible values are 'zscore', 'minmax' or empty (no scaling) (default: )
inputCols: previous annotations columns, if renamed (current: ['features'])
lazyAnnotator: Whether this AnnotatorModel acts as lazy in RecursivePipelines (default: False)
multiClass: Whether to return only the label with the highest confidence score or all labels (default: False, current: False)
outputCol: output annotation column. can be left default. (current: class)
storageRef: storage unique identifier (default: GenericSVMClassifierModel_5b8f3a210253)


### ▶`featureScaling`


buraya feature scaling tanim eklenmesi

In [None]:
generic_classifier = medical.GenericSVMClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("class")\
    .setFeatureScaling('zscore')

clf_Pipeline2 = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    features_asm,
    generic_classifier])

generic_svm_classifier_ade download started this may take some time.
[OK!]


In [None]:
result2 = clf_Pipeline2.fit(data).transform(df)
result2.select("text", "class.result","class.metadata").show(truncate=100)   # zscore values

+----------------------------------------------------------------------------------------------------+-------+----------------------------+
|                                                                                                text| result|                    metadata|
+----------------------------------------------------------------------------------------------------+-------+----------------------------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|[{confidence -> 0.89175177}]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|[{confidence -> 0.97560674}]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|[{confidence -> 0.99982864}]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...| [True]| [{confidence -> 0.7110752}]|
|However, given the 

In [None]:
result.select("text", "class.result","class.metadata").show(truncate=100)  # default values

+----------------------------------------------------------------------------------------------------+-------+----------------------------+
|                                                                                                text| result|                    metadata|
+----------------------------------------------------------------------------------------------------+-------+----------------------------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|[{confidence -> 0.89175177}]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|[{confidence -> 0.97560674}]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|[{confidence -> 0.99982864}]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...| [True]| [{confidence -> 0.7110752}]|
|However, given the 

In [None]:
generic_classifier = medical.GenericSVMClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("class")\
    .setFeatureScaling('minmax')

clf_Pipeline3 = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    features_asm,
    generic_classifier])

generic_svm_classifier_ade download started this may take some time.
[OK!]


In [None]:
result3 = clf_Pipeline3.fit(data).transform(df)
result3.select("text", "class.result","class.metadata").show(truncate=100)  #minmax values

+----------------------------------------------------------------------------------------------------+-------+----------------------------+
|                                                                                                text| result|                    metadata|
+----------------------------------------------------------------------------------------------------+-------+----------------------------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.| [True]| [{confidence -> 0.6639472}]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.| [True]|[{confidence -> 0.58102447}]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]| [{confidence -> 0.4665953}]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...| [True]|[{confidence -> 0.71549857}]|
|However, given the 

As shown above, three different `featureScaling` options generated different results and confidence values, and the best one may be chosen according to problem/dataset.

### ▶ `multiClass`

Whether to return only the label with the highest confidence score or all labels

In [None]:
generic_classifier = medical.GenericSVMClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("class")\
    .setMultiClass(True)

clf_Pipeline4 = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    features_asm,
    generic_classifier])

generic_svm_classifier_ade download started this may take some time.
[OK!]


In [None]:
result4 = clf_Pipeline4.fit(data).transform(df)
result4.select("text", "class.result","class.metadata").show(truncate=100)

# below, result column is not one value, it is list of values showing confidence of each classes

+----------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------+
|                                                                                                text|       result|                                                metadata|
+----------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False, True]|  [{confidence -> 0.3259712}, {confidence -> 0.6639472}]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False, True]|[{confidence -> 0.33836725}, {confidence -> 0.58102447}]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False, True]| [{confidence 