![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **FeaturesAssembler**

This notebook will cover the different usages of `FeaturesAssembler` annotator. 

**📖 Learning Objectives:**

1. Understand how to use `FeaturesAssembler`.

2. Using `FeaturesAssembler` with several columns.

3. Using `FeaturesAssembler` with embeddings.


**🔗 Helpful Links:**

- Documentation : [FeaturesAssembler](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#featuresassembler)

- Python Docs : [FeaturesAssembler](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/feature_assembler/index.html#module-sparknlp_jsl.annotator.feature_assembler)

- Scala Docs : [FeaturesAssembler](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/FeaturesAssembler.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/08.2.Generic_Classifier.ipynb).

## **📜 Background**


The `FeaturesAssembler` is used to collect features from different columns. It can collect features from single value columns (anything which can be cast to a float, if casts fail then the value is set to 0), array columns, or SparkNLP annotations (if the annotation is an embedding, it takes the embedding, otherwise tries to cast the result field). The output of the transformer is a `feature_vector` annotation (the numeric vector is in the embeddings field).

The output of `FeaturesAssembler` (`feature_vector`) is used as input for the **`GenericClassifier, GenericSVMClassifier, GenericLogRegClassifier`** annotators.

## **🎬 Colab Setup**

In [1]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp

nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
import numpy as np

spark = nlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `NONE`

- Output: `FEATURE_VECTOR`

## **🔎 Parameters**


The parameters below are used for `FeaturesAssembler`.


- `inputCols`: The name of the columns containing the input annotations. It can read either a String column name or an Array of strings (column names).
- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.


All the parameters can be set using the corresponding set method in the camel case. For example, `.setInputcols()`.

## Using `FeaturesAssembler` with several columns

Let's load a dataset and create some float valued features to be used in `FeaturesAssembler` 

In [5]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/petfinder-mini.csv

In [6]:
# prepare the data
dataframe = pd.read_csv('petfinder-mini.csv')
dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1) # In the original dataset "4" indicates the pet was not adopted.
dataframe.Description = dataframe.Description.fillna('- no description -')
dataframe.head(3)

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,AdoptionSpeed,target
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,Nibble is a 3+ month old ball of cuteness. He ...,1,2,1
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,I just found it alone yesterday near my apartm...,2,0,1
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,Their pregnant mother was dumped by her irresp...,7,3,1


To create numerical features we will use OneHotEncoder and TF-IDF methods. Because `FeaturesAssembler` can not cast string values to folat.

In [7]:
from sklearn.compose import make_column_transformer
# from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

column_trans = make_column_transformer(
     (OneHotEncoder(), ['Type', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize',
       'FurLength', 'Vaccinated', 'Sterilized', 'Health']),
     (TfidfVectorizer(max_features=15,  norm='l2', ngram_range=(1, 3)), 'Description'),
     remainder=StandardScaler())

X = column_trans.fit_transform(dataframe.drop(['target'], axis=1))

y = dataframe.target

In [8]:
df = pd.DataFrame.sparse.from_spmatrix(X)

feature_columns = ['col_{}'.format(i) for i in range(X.shape[1])]

df.columns = feature_columns

df['target']= y

df.head(3)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,col_209,col_210,col_211,col_212,col_213,col_214,col_215,col_216,col_217,target
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.381339,0.29967,0.0,0.0,0.0,-0.452479,0.950288,-0.829762,-0.414688,1
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.826697,0.0,0.0,0.0,-0.555981,-0.299388,-0.511871,-2.119392,1
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.366811,0.720633,0.0,0.0,0.0,-0.555981,-0.299388,1.077583,0.437664,1


In [9]:
spark_df = spark.createDataFrame(df)
spark_df.select(spark_df.columns[-10:]).show(2)

+-------------------+------------------+-------+-------+-------+-------------------+--------------------+-------------------+--------------------+------+
|            col_209|           col_210|col_211|col_212|col_213|            col_214|             col_215|            col_216|             col_217|target|
+-------------------+------------------+-------+-------+-------+-------------------+--------------------+-------------------+--------------------+------+
|0.38133910553960104|0.2996700768545381|    0.0|    0.0|    0.0|-0.4524794726808656|  0.9502875792756131|-0.8297616989552165|-0.41468778162984526|     1|
|                0.0|0.8266974703081306|    0.0|    0.0|    0.0| -0.555981017719065|-0.29938816657135553|-0.5118709929431844| -2.1193921951924772|     1|
+-------------------+------------------+-------+-------+-------+-------------------+--------------------+-------------------+--------------------+------+
only showing top 2 rows



In [10]:
# dispaly feature columns' names
print(feature_columns)

['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8', 'col_9', 'col_10', 'col_11', 'col_12', 'col_13', 'col_14', 'col_15', 'col_16', 'col_17', 'col_18', 'col_19', 'col_20', 'col_21', 'col_22', 'col_23', 'col_24', 'col_25', 'col_26', 'col_27', 'col_28', 'col_29', 'col_30', 'col_31', 'col_32', 'col_33', 'col_34', 'col_35', 'col_36', 'col_37', 'col_38', 'col_39', 'col_40', 'col_41', 'col_42', 'col_43', 'col_44', 'col_45', 'col_46', 'col_47', 'col_48', 'col_49', 'col_50', 'col_51', 'col_52', 'col_53', 'col_54', 'col_55', 'col_56', 'col_57', 'col_58', 'col_59', 'col_60', 'col_61', 'col_62', 'col_63', 'col_64', 'col_65', 'col_66', 'col_67', 'col_68', 'col_69', 'col_70', 'col_71', 'col_72', 'col_73', 'col_74', 'col_75', 'col_76', 'col_77', 'col_78', 'col_79', 'col_80', 'col_81', 'col_82', 'col_83', 'col_84', 'col_85', 'col_86', 'col_87', 'col_88', 'col_89', 'col_90', 'col_91', 'col_92', 'col_93', 'col_94', 'col_95', 'col_96', 'col_97', 'col_98', 'col_99', 'col_100'

`FeaturesAssembler` will combine all `feature_columns` above into one column which is a `feature_vector` type.

In [11]:
features_asm = medical.FeaturesAssembler()\
    .setInputCols(feature_columns)\
    .setOutputCol("features")

In [12]:
result = features_asm.transform(spark_df)

In [13]:
result.show(2)

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----

In [14]:
result.select("features").show(10, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now we generated `features` column and it can be used in `GenericClassifier, GenericSVMClassifier, GenericLogRegClassifier` annotators as an input column.

## Using `FeaturesAssembler` with embeddings




Load some data to extract embeddings.

In [15]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mtsamples_classifier.csv

In [16]:
spark_df = spark.read.csv("mtsamples_classifier.csv", header = True)

spark_df.show(10,truncate=100)

+----------------+----------------------------------------------------------------------------------------------------+
|        category|                                                                                                text|
+----------------+----------------------------------------------------------------------------------------------------+
|Gastroenterology| PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active f...|
|Gastroenterology| OPERATION 1. Ivor-Lewis esophagogastrectomy. 2. Feeding jejunostomy. 3. Placement of two right-s...|
|Gastroenterology| PREOPERATIVE DIAGNOSES: 1. Gastroesophageal reflux disease. 2. Chronic dyspepsia. POSTOPERATIVE ...|
|Gastroenterology| PROCEDURE: Colonoscopy. PREOPERATIVE DIAGNOSES: Rectal bleeding and perirectal abscess. POSTOPER...|
|Gastroenterology| PREOPERATIVE DIAGNOSIS: Right colon tumor. POSTOPERATIVE DIAGNOSES: 1. Right colon cancer. 2. As...|
|Gastroenterology| PREOPERATIVE DIAGNOSI

Now let's create a pipeline that extracts sentence embeddings and creates a `feature_vector` at the end using `FeaturesAssembler`.

In [17]:
document_assembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
            .setInputCols(["document"]) \
            .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
        .setInputCols(["document","token"])\
        .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
        .setInputCols(["document", "word_embeddings"]) \
        .setOutputCol("sentence_embeddings") \
        .setPoolingStrategy("AVERAGE")

features_asm =medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
    ])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [18]:
result = embeddings_pipeline.fit(spark_df).transform(spark_df)

In [19]:
result.show(2)

+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|        category|                text|            document|               token|     word_embeddings| sentence_embeddings|            features|
+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Gastroenterology| PROCEDURES PERFO...|[{document, 0, 30...|[{token, 1, 10, P...|[{word_embeddings...|[{sentence_embedd...|[{feature_vector,...|
|Gastroenterology| OPERATION 1. Ivo...|[{document, 0, 59...|[{token, 1, 9, OP...|[{word_embeddings...|[{sentence_embedd...|[{feature_vector,...|
+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



In [20]:
result.select("sentence_embeddings.embeddings").show(5, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [21]:
result.select("features").show(5, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now `features` column is generated from embeddings and it can be used in `GenericClassifier, GenericSVMClassifier, GenericLogRegClassifier` annotators as an input column.