Features
---
MLlib Main Guide: https://spark.apache.org/docs/2.4.3/ml-features.html

This module contains algorithms for working with features, roughly divided into these groups:

- Extraction: Extracting features from “raw” data
- Transformation: Scaling, converting, or modifying features
- Selection: Selecting a subset from a larger set of features
- Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.

## pyspark.ml.feature
Class structure: https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#module-pyspark.ml.feature  
GitHub: https://github.com/apache/spark/blob/v2.4.3/python/pyspark/ml/feature.py

### [Feature Extractors](https://spark.apache.org/docs/2.4.3/ml-features.html#feature-extractors)

- [HashingTF](https://spark.apache.org/docs/2.4.3/ml-features.html#tf-idf)
- [IDF & IDFModel](https://spark.apache.org/docs/2.4.3/ml-features.html#tf-idf)
- [Word2Vec & Word2VecModel](https://spark.apache.org/docs/2.4.3/ml-features.html#word2vec)
- [CountVectorizer & CountVectorizerModel](https://spark.apache.org/docs/2.4.3/ml-features.html#countvectorizer)
- [FeatureHasher](https://spark.apache.org/docs/2.4.3/ml-features.html#featurehasher)


### [Feature Transformers](https://spark.apache.org/docs/2.4.3/ml-features.html#feature-transformers)

- [Tokenizer](https://spark.apache.org/docs/2.4.3/ml-features.html#tokenizer)
- [RegexTokenizer](https://spark.apache.org/docs/2.4.3/ml-features.html#tokenizer)
- [StopWordsRemover](https://spark.apache.org/docs/2.4.3/ml-features.html#stopwordsremover)
- [NGram](https://spark.apache.org/docs/2.4.3/ml-features.html#n-gram)
- [Binarizer](https://spark.apache.org/docs/2.4.3/ml-features.html#binarizer)
- [PCA](https://spark.apache.org/docs/2.4.3/ml-features.html#pca)
- [PCAModel](https://spark.apache.org/docs/2.4.3/ml-features.html#pca)
- [PolynomialExpansion](https://spark.apache.org/docs/2.4.3/ml-features.html#polynomialexpansion)
- [DCT](https://spark.apache.org/docs/2.4.3/ml-features.html#discrete-cosine-transform-dct)
- [StringIndexer & StringIndexerModel](https://spark.apache.org/docs/2.4.3/ml-features.html#stringindexer)
- [IndexToString](https://spark.apache.org/docs/2.4.3/ml-features.html#indextostring)
- [OneHotEncoder & OneHotEncoderModel](https://spark.apache.org/docs/2.4.3/ml-features.html#onehotencoder-deprecated-since-230)
- [OneHotEncoderEstimator](https://spark.apache.org/docs/2.4.3/ml-features.html#onehotencoderestimator)
- [VectorIndexer](https://spark.apache.org/docs/2.4.3/ml-features.html#vectorindexer)
- [VectorIndexerModel](https://spark.apache.org/docs/2.4.3/ml-features.html#vectorindexer)
- [Normalizer](https://spark.apache.org/docs/2.4.3/ml-features.html#normalizer)
- [StandardScaler & StandardScalerModel](https://spark.apache.org/docs/2.4.3/ml-features.html#standardscaler)
- [MinMaxScaler & MinMaxScalerModel](https://spark.apache.org/docs/2.4.3/ml-features.html#minmaxscaler)
- [MaxAbsScaler & MaxAbsScalerModel](https://spark.apache.org/docs/2.4.3/ml-features.html#maxabsscaler)
- [Bucketizer](https://spark.apache.org/docs/2.4.3/ml-features.html#bucketizer)
- [ElementwiseProduct](https://spark.apache.org/docs/2.4.3/ml-features.html#elementwiseproduct)
- [SQLTransformer](https://spark.apache.org/docs/2.4.3/ml-features.html#sqltransformer)
- [VectorAssembler](https://spark.apache.org/docs/2.4.3/ml-features.html#vectorassembler)
- [VectorSizeHint](https://spark.apache.org/docs/2.4.3/ml-features.html#vectorsizehint)
- [QuantileDiscretizer](https://spark.apache.org/docs/2.4.3/ml-features.html#quantilediscretizer)
- [Imputer & ImputerModel](https://spark.apache.org/docs/2.4.3/ml-features.html#imputer)

Not available in Python:
- [Interaction](https://spark.apache.org/docs/2.4.3/ml-features.html#interaction)

## [Feature Selectors](https://spark.apache.org/docs/2.4.3/ml-features.html#feature-selectors)

- [VectorSlicer](https://spark.apache.org/docs/2.4.3/ml-features.html#vectorslicer)
- [RFormula & RFormulaModel](https://spark.apache.org/docs/2.4.3/ml-features.html#rformula)
- [ChiSqSelector & ChiSqSelectorModel](https://spark.apache.org/docs/2.4.3/ml-features.html#chisqselector)


## [Locality Sensitive Hashing (LSH)](https://spark.apache.org/docs/2.4.3/ml-features.html#locality-sensitive-hashing)

- [BucketedRandomProjectionLSH & BucketedRandomProjectionLSHModel](https://spark.apache.org/docs/2.4.3/ml-features.html#bucketed-random-projection-for-euclidean-distance)
- [MinHashLSH & MinHashLSHModel](https://spark.apache.org/docs/2.4.3/ml-features.html#minhash-for-jaccard-distance)

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [170]:
from pyspark.ml.feature import QuantileDiscretizer

data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)]
df = spark.createDataFrame(data, ["id", "hour"])

discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result")

result = discretizer.fit(df).transform(df)
result.show()

+---+----+------+
| id|hour|result|
+---+----+------+
|  0|18.0|   2.0|
|  1|19.0|   2.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   0.0|
+---+----+------+



In [171]:
from pyspark.ml.feature import Word2Vec

# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame(
    [
        ("Hi I heard about Spark".split(" "),),
        ("I wish Java could use case classes".split(" "),),
        ("Logistic regression models are neat".split(" "),),
    ],
    ["text"],
)

# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)

result = model.transform(documentDF)
for row in result.collect():
    text, vector = row
    print(f"Text: {text} => \nVector: {vector}\n")

Text: ['Hi', 'I', 'heard', 'about', 'Spark'] => 
Vector: [0.01781592555344105,0.02203895077109337,0.02248857170343399]

Text: ['I', 'wish', 'Java', 'could', 'use', 'case', 'classes'] => 
Vector: [-0.00022016598709991998,-0.0291545108027224,-0.01409794549856867]

Text: ['Logistic', 'regression', 'models', 'are', 'neat'] => 
Vector: [-0.03180776406079531,0.0591656094416976,0.005205358564853668]



In [173]:
from pyspark.ml.feature import FeatureHasher

dataset = spark.createDataFrame(
    [
        (2.2, True, "1", "foo"),
        (3.3, False, "2", "bar"),
        (4.4, False, "3", "baz"),
        (5.5, False, "4", "foo"),
    ],
    ["real", "bool", "stringNum", "string"],
)

hasher = FeatureHasher(
    inputCols=["real", "bool", "stringNum", "string"], outputCol="features"
)

featurized = hasher.transform(dataset)
featurized.show(truncate=False)

+----+-----+---------+------+--------------------------------------------------------+
|real|bool |stringNum|string|features                                                |
+----+-----+---------+------+--------------------------------------------------------+
|2.2 |true |1        |foo   |(262144,[174475,247670,257907,262126],[2.2,1.0,1.0,1.0])|
|3.3 |false|2        |bar   |(262144,[70644,89673,173866,174475],[1.0,1.0,1.0,3.3])  |
|4.4 |false|3        |baz   |(262144,[22406,70644,174475,187923],[1.0,1.0,4.4,1.0])  |
|5.5 |false|4        |foo   |(262144,[70644,101499,174475,257907],[1.0,1.0,5.5,1.0]) |
+----+-----+---------+------+--------------------------------------------------------+



In [180]:
from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
    [
        (0, "a"),
        (1, "b"),
        (2, "c"),
        (3, "a"),
        (4, "a"),
        (5, "c"),
        (6, "d"),
        (7, "a"),
        (8, "d"),
    ],
    ["id", "category"],
)

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
|  0|       a|          0.0|
|  1|       b|          3.0|
|  2|       c|          1.0|
|  3|       a|          0.0|
|  4|       a|          0.0|
|  5|       c|          1.0|
|  6|       d|          2.0|
|  7|       a|          0.0|
|  8|       d|          2.0|
+---+--------+-------------+



In [184]:
from pyspark.ml.feature import VectorIndexer

data = spark.read.format("libsvm").load(
    "/usr/local/spark-2.4.3-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt"
)

indexer = VectorIndexer(inputCol="features", outputCol="indexed", maxCategories=10)
indexerModel = indexer.fit(data)

categoricalFeatures = indexerModel.categoryMaps
print(
    f"Chose {len(categoricalFeatures)} categorical "
    f"features: {[str(k) for k in categoricalFeatures.keys()]}"
)

# Create new column "indexed" with categorical values transformed to indices
indexedData = indexerModel.transform(data)
indexedData.show()

Chose 351 categorical features: ['645', '69', '365', '138', '101', '479', '333', '249', '0', '555', '666', '88', '170', '115', '276', '308', '5', '449', '120', '247', '614', '677', '202', '10', '56', '533', '142', '500', '340', '670', '174', '42', '417', '24', '37', '25', '257', '389', '52', '14', '504', '110', '587', '619', '196', '559', '638', '20', '421', '46', '93', '284', '228', '448', '57', '78', '29', '475', '164', '591', '646', '253', '106', '121', '84', '480', '147', '280', '61', '221', '396', '89', '133', '116', '1', '507', '312', '74', '307', '452', '6', '248', '60', '117', '678', '529', '85', '201', '220', '366', '534', '102', '334', '28', '38', '561', '392', '70', '424', '192', '21', '137', '165', '33', '92', '229', '252', '197', '361', '65', '97', '665', '583', '285', '224', '650', '615', '9', '53', '169', '593', '141', '610', '420', '109', '256', '225', '339', '77', '193', '669', '476', '642', '637', '590', '679', '96', '393', '647', '173', '13', '41', '503', '134', '73'

In [194]:
from pyspark.ml.feature import Bucketizer

splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")]

data = [
    (-999.9,),
    (-0.5,),
    (-0.3,),
    (-0.4,),
    (-0.2,),
    (0.0,),
    (0.1,),
    (0.2,),
    (0.3,),
    (999.9,),
]
dataFrame = spark.createDataFrame(data, ["features"])

bucketizer = Bucketizer(
    splits=splits, inputCol="features", outputCol="bucketedFeatures"
)

# Transform original data into its bucket index.
bucketedData = bucketizer.transform(dataFrame)

print(f"Bucketizer output with {len(bucketizer.getSplits())-1} buckets")
bucketedData.show()

Bucketizer output with 4 buckets
+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
|  -999.9|             0.0|
|    -0.5|             1.0|
|    -0.3|             1.0|
|    -0.4|             1.0|
|    -0.2|             1.0|
|     0.0|             2.0|
|     0.1|             2.0|
|     0.2|             2.0|
|     0.3|             2.0|
|   999.9|             3.0|
+--------+----------------+

