![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/GenericClassifierApproach.ipynb)

# **GenericClassifierApproach**

This notebook will cover the different parameters and usages of `GenericClassifierApproach`.

**📖 Learning Objectives:**

1. Understand how trains a TensorFlow model for generic classification of feature vectors.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [GenericClassifierApproach](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#genericclassifier)

- Python Docs : [GenericClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/generic_classifier/generic_classifier/index.html)

- Scala Docs : [GenericClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/generic_classifier/GenericClassifierApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/8.Generic_Classifier.ipynb).

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.8

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

spark = nlp.start()
spark

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `FEATURE_VECTOR`

- Output: `CATEGORY`

## **🔎 Parameters**


- `batchSize`: (int) Batch size

- `dropout`: (float) Dropout coefficient

- `epochsN`: (int) Maximum number of epochs to train

- `featureScaling`: (str) Feature scaling method. Possible values are 'zscore', 'minmax' or empty (no scaling)

- `fixImbalance`: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories

- `labelColumn`: (str) Column with label per each document

- `learningRate`: (float) Learning Rate

- `modelFile`: (str) Location of file of the model used for classification

- `multiClass`: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.

- `outputLogsPath`: (str) Folder path to save training logs. If no path is specified, the logs won't be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

- `validationSplit`: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0.

## Data Prepare

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/petfinder-mini.csv

In [None]:
dataframe = pd.read_csv('petfinder-mini.csv')

In [None]:
# In the original dataset "4" indicates the pet was not adopted.
dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1)

In [None]:
dataframe = dataframe.drop(['AdoptionSpeed'], axis=1)

In [None]:
dataframe.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,target
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,Nibble is a 3+ month old ball of cuteness. He ...,1,1
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,I just found it alone yesterday near my apartm...,2,1
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,Their pregnant mother was dumped by her irresp...,7,1
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,"Good guard dog, very alert, active, obedience ...",8,1
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,This handsome yet cute boy is up for adoption....,3,1


In [None]:
dataframe.columns

Index(['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize',
       'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Fee', 'Description',
       'PhotoAmt', 'target'],
      dtype='object')

In [None]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11537 entries, 0 to 11536
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Type          11537 non-null  object
 1   Age           11537 non-null  int64 
 2   Breed1        11537 non-null  object
 3   Gender        11537 non-null  object
 4   Color1        11537 non-null  object
 5   Color2        11537 non-null  object
 6   MaturitySize  11537 non-null  object
 7   FurLength     11537 non-null  object
 8   Vaccinated    11537 non-null  object
 9   Sterilized    11537 non-null  object
 10  Health        11537 non-null  object
 11  Fee           11537 non-null  int64 
 12  Description   11528 non-null  object
 13  PhotoAmt      11537 non-null  int64 
 14  target        11537 non-null  int64 
dtypes: int64(4), object(11)
memory usage: 1.3+ MB


In [None]:
dataframe.target.value_counts()

target
1    8457
0    3080
Name: count, dtype: int64

In [None]:
dataframe.Description = dataframe.Description.fillna('- no description -')

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

column_trans = make_column_transformer(
     (OneHotEncoder(), ['Type', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize',
       'FurLength', 'Vaccinated', 'Sterilized', 'Health']),
     (TfidfVectorizer(max_features=100,  norm='l2', ngram_range=(1, 3)), 'Description'),
     remainder=StandardScaler())

X = column_trans.fit_transform(dataframe.drop(['target'], axis=1))

y = dataframe.target

In [None]:
y.nunique()

2

In [None]:
X.shape

(11537, 302)

In [None]:
input_dim = X.shape[1]

In [None]:
input_dim

302

In [None]:
df = pd.DataFrame.sparse.from_spmatrix(X)

df.columns = ['col_{}'.format(i) for i in range(input_dim)]

df['target']= y

df.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,col_293,col_294,col_295,col_296,col_297,col_298,col_299,col_300,col_301,target
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.452479,0.950288,-0.829762,1
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.555981,-0.299388,-0.511871,1
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.555981,-0.299388,1.077583,1
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.22648,0.0,-0.400729,1.575125,1.395473,1
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.198113,0.0,0.133816,0.181835,0.0,-0.555981,-0.299388,-0.19398,1


## Building a pipeline

The FeaturesAssembler is used to collect features from different columns. It can collect features from single value columns (anything which can be cast to a float, if casts fails then the value is set to 0), array columns or SparkNLP annotations (if the annotation is an embedding, it takes the embedding, otherwise tries to cast the 'result' field). The output of the transformer is a FEATURE_VECTOR annotation (the numeric vector is in the 'embeddings' field).

The GenericClassifierApproach takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
spark_df = spark.createDataFrame(df)
spark_df.select(spark_df.columns[-10:]).show(2)

+-------+-------+-------+-------+-------+-------+-------------------+--------------------+-------------------+------+
|col_293|col_294|col_295|col_296|col_297|col_298|            col_299|             col_300|            col_301|target|
+-------+-------+-------+-------+-------+-------+-------------------+--------------------+-------------------+------+
|    0.0|    0.0|    0.0|    0.0|    0.0|    0.0|-0.4524794726808656|  0.9502875792756131|-0.8297616989552165|     1|
|    0.0|    0.0|    0.0|    0.0|    0.0|    0.0| -0.555981017719065|-0.29938816657135553|-0.5118709929431844|     1|
+-------+-------+-------+-------+-------+-------+-------------------+--------------------+-------------------+------+
only showing top 2 rows



In [None]:
import pyspark.sql.functions as F
spark_df.groupBy("target") \
    .count() \
    .orderBy(F.col("count").desc()) \
    .show()

+------+-----+
|target|count|
+------+-----+
|     1| 8457|
|     0| 3080|
+------+-----+



In [None]:
(training_data, test_data) = spark_df.randomSplit([0.8, 0.2], seed = 100)

print("Training Dataset Count: " + str(training_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 9234
Test Dataset Count: 2303


In [None]:
!pip install -q tensorflow tensorflow_addons

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/611.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/611.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m604.2/611.8 kB[0m [31m10.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.8/611.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
graph_folder = "gc_graph"

gc_graph_builder = medical.TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("target")\
    .setHiddenLayers([300, 200, 100])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph.pb")

In [None]:
!mkdir logs

features_asm = medical.FeaturesAssembler()\
    .setInputCols(['col_{}'.format(i) for i in range(X.shape[1])])\
    .setOutputCol("features")

gen_clf = medical.GenericClassifierApproach()\
    .setLabelColumn("target")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph.pb")\
    .setEpochsNumber(50)\
    .setBatchSize(64)\
    .setFeatureScaling("zscore")\
    .setFixImbalance(True)\
    .setLearningRate(0.002)\
    .setOutputLogsPath("logs")\
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_graph_builder,
    gen_clf])


In [None]:
%%time

# train 50 epochs (takes around 1 min)

clf_model = clf_Pipeline.fit(training_data)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph.pb
Build params: {'input_dim': 302, 'output_dim': 2, 'hidden_layers': [300, 200, 100], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}


Instructions for updating:
Colocations handled automatically by placer.


generic_classifier graph exported to gc_graph/gcf_graph.pb
CPU times: user 7.71 s, sys: 544 ms, total: 8.25 s
Wall time: 1min 27s


In [None]:
import os
log_file_name = os.listdir("logs")[0]

with open("logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training 50 epochs
Epoch 1/50	1.26s	Loss: 29.72683	ACC: 0.63042784	Validation ACC: 0.6751489
Epoch 2/50	0.57s	Loss: 25.212645	ACC: 0.71279335	Validation ACC: 0.67081755
Epoch 3/50	0.49s	Loss: 22.523352	ACC: 0.7565553	Validation ACC: 0.6897672
Epoch 4/50	0.49s	Loss: 19.700336	ACC: 0.7978378	Validation ACC: 0.6789388
Epoch 5/50	0.47s	Loss: 16.094986	ACC: 0.8396641	Validation ACC: 0.6962642
Epoch 6/50	0.53s	Loss: 13.144633	ACC: 0.87848717	Validation ACC: 0.7049269
Epoch 7/50	0.50s	Loss: 10.431298	ACC: 0.9068736	Validation ACC: 0.7173795
Epoch 8/50	0.46s	Loss: 7.350108	ACC: 0.93866235	Validation ACC: 0.7222523
Epoch 9/50	0.46s	Loss: 5.348463	ACC: 0.95652735	Validation ACC: 0.7103411
Epoch 10/50	0.60s	Loss: 4.086995	ACC: 0.96883476	Validation ACC: 0.7249594
Epoch 11/50	0.47s	Loss: 3.014661	ACC: 0.97952586	Validation ACC: 0.71792096
Epoch 12/50	0.61s	Loss: 2.4270387	ACC: 0.9829781	Validation ACC: 0.72171086
Epoch 13/50	0.50s	Loss: 1.536094	ACC: 0.99043643	Validation ACC: 0.724418
Epoch 14/50

In [None]:
pred_df = clf_model.transform(test_data)

In [None]:
pred_df.select('target','prediction.result').show()

+------+------+
|target|result|
+------+------+
|     1|   [1]|
|     1|   [1]|
|     0|   [0]|
|     1|   [1]|
|     0|   [1]|
|     1|   [1]|
|     0|   [1]|
|     1|   [1]|
|     1|   [1]|
|     1|   [1]|
|     0|   [1]|
|     1|   [0]|
|     1|   [1]|
|     1|   [1]|
|     1|   [1]|
|     1|   [1]|
|     1|   [1]|
|     1|   [1]|
|     1|   [1]|
|     1|   [1]|
+------+------+
only showing top 20 rows



In [None]:
preds_df = pred_df.select('target','prediction.result').toPandas()

# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : int(x[0]))


In [None]:
preds_df.sample(10)

Unnamed: 0,target,result
1223,1,1
1207,1,1
866,1,1
1123,1,0
1926,1,1
156,1,0
680,1,1
1130,1,1
1918,0,0
1853,0,0


In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report, accuracy_score

print (classification_report(preds_df['target'], preds_df['result'], digits=4))

print (accuracy_score(preds_df['target'], preds_df['result']))


              precision    recall  f1-score   support

           0     0.5218    0.4541    0.4856       632
           1     0.8032    0.8426    0.8224      1671

    accuracy                         0.7360      2303
   macro avg     0.6625    0.6484    0.6540      2303
weighted avg     0.7260    0.7360    0.7300      2303

0.7359965262700825
