![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **GenericLogRegClassifierModel**

This notebook will cover the different parameters and usages of `GenericLogRegClassifierModel`. This annotator derivative of GenericClassifier which implements a multinomial Logistic Regression.

**📖 Learning Objectives:**

1. This is a single layer neural network with the logistic function at the output. The input to the model is FeatureVector and the output is Category annotations with labels and corresponding confidence scores varying between 0 and 1. Training data requires "text" and their "label" columns only and the trained model will be a GenericLogRegClassifierModel().

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**


- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare).

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp

nlp.install()

In [4]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/4.4.4.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `FEATURE_VECTOR`

- Output: `CATEGORY`

## **🔎 Parameters**


- `LabelColumn`: This parameter sets the name of the column in your input data that contains the labels (categories) for the classification task. The classifier will use this column to learn from the data and make predictions.

- `ModelFile`: This parameter specifies the path to the pre-trained model file for the logistic regression classifier. It should be a protobuf file containing the model graph and trained weights.

- `EpochsNumber`: This parameter sets the number of epochs (iterations) the classifier will go through during the training process. An epoch represents one complete pass through the entire training dataset.

- `BatchSize`: This parameter sets the batch size used during training. The training data is divided into batches, and the model's weights are updated after processing each batch. A larger batch size may speed up training, but it requires more memory.

- `LearningRate`: This parameter sets the learning rate for the optimization algorithm used during training. The learning rate determines how much the model's weights are updated based on the computed gradients. A higher learning rate may lead to faster convergence but risks overshooting the optimal solution.

- `OutputLogsPath`: This parameter specifies the path where the logs related to the training process will be stored. These logs can include information such as training loss, accuracy, and other metrics.

- `Dropout`: Dropout is a regularization technique used to prevent overfitting in neural networks. This parameter sets the dropout rate, which determines the probability that each neuron's output will be temporarily ignored during training.

- `FixImbalance`: Imbalance refers to the situation when some classes have significantly more training examples than others. Setting this parameter to True indicates that the classifier will handle class imbalance during training to help ensure that the model doesn't become biased towards the majority class.

- `ValidationSplit`: This line seems to be commented out, but it's worth mentioning its purpose. If uncommented and set to a value between 0 and 1, it would specify the fraction of the training data to be used for validation during the training process. The remaining data would be used for actual training.

In [None]:
graph_folder = "gc_graph"

gc_logreg_graph_builder = medical.TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

In [None]:
features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = medical.GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_logreg_graph_builder,
    gen_clf])