# Parameters
MLlib `Estimators` and `Transformers` use a uniform API for specifying parameters.

A Param is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.

There are two main ways to pass parameters to an algorithm:

- Set parameters for an instance. E.g., if `lr` is an instance of `LogisticRegression`, one could call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations. This API resembles the API used in spark.mllib package.

- Pass a `ParamMap` to `.fit()` or `.transform()`. Any parameters in the `ParamMap` will override parameters previously specified via setter methods.

Parameters belong to specific instances of `Estimators` and `Transformers`. For example, if we have two LogisticRegression instances `lr1` and `lr2`, then we can build a ParamMap with both `maxIter` parameters specified: `ParamMap({lr1.maxIter: 10, lr2.maxIter:  20})`. This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.


### Example
This example was adapted from Spark's MLlib: Main Guide. Link to original:  
https://spark.apache.org/docs/2.4.3/ml-pipeline.html#example-estimator-transformer-and-param

In [None]:
from IPython.core.display import display, HTML
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

spark = SparkSession.builder.getOrCreate()

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame(
    [
        (1.0, Vectors.dense([0.0, 1.1, 0.1])),
        (0.0, Vectors.dense([2.0, 1.0, -1.0])),
        (0.0, Vectors.dense([2.0, 1.3, 1.0])),
        (1.0, Vectors.dense([0.0, 1.2, -0.5])),
    ],
    ["label", "features"],
)

# Prepare test data
test = spark.createDataFrame(
    [
        (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
        (0.0, Vectors.dense([3.0, 2.0, -0.1])),
        (1.0, Vectors.dense([0.0, 2.2, -1.5])),
    ],
    ["label", "features"],
)


In [182]:
# Create a LogisticRegression instance. This instance is an Estimator.
# maxIter and regParam are parameters
lr = LogisticRegression(maxIter=10, regParam=0.01)

# Print out the parameters, documentation, and any default values.
print(f"LogisticRegression parameters:\n{lr.explainParams()}")

LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bou

In [183]:
# I developed two simple helper scripts to help display
# the content of Params and ParamMaps in a more readable way


def print_explainParams(cls, font_size=1):
    """Helper class for pretty printing MLlib parameters, 
    and similar output for use in Jupyter / IPython.
    
    Usage example:
    > print_explainParams(pyspark.ml.LogisticRegression)
    
    Parameters
    ----------
    cls
        input class (should be able to run the explainParams method)
        
    font_size : int
        control displayed font size (default = 1)
    """
    title = f"<h2>Parameters for: {str(cls)}</h2>"
    params = str(cls.explainParams()).split("\n")
    html_body = "\n".join([f'<h4>{p.replace(":", "</h4> <p>", 1)}</p>' for p in params])
    display(HTML(f"<font size='{font_size}'>{title} {html_body}</font>"))


def print_explainParamMap(cls, display_docs=True, font_size=1):
    """Helper class for pretty printing MLlib parameters, 
    and similar output for use in Jupyter / IPython.
    
    Usage example:
    > lr = pyspark.ml.LogisticRegressionModel
    > print_explainParamMap(lr)
    > print_explainParamMap(lr, False)
    
    Parameters
    ----------
    cls
        input class (should be able to run the explainParamMap method)
        
    font_size : int
        control displayed font size (default = 1)
        
    display_docs : bool
        toggles displaying the docs or not
    """
    title = f'<font size="{font_size}"><h3>Parameter Map </h3>{model1}</font>'
    param_map: dict = model1.extractParamMap()
    html = []
    if display_docs:
        html.append(title)
    for p in param_map.items():
        param = p[0]
        value = p[1]
        if display_docs:
            html.append(
                f"""
            <font size="{font_size}"><h4>{param.name}</h4></font>
            <p>
                <font size="{font_size - 1}">doc: <i>{param.doc}</i><br/>
                value: </font><font size="{font_size + 1}">{value}</font>
            </p>
            """
            )
        else:
            html.append(
                f'<li><font size="{font_size}"><b>{param.name}:</b> {value}</font></li>'
            )

    display(HTML(f'{"".join(html)}'))


In [184]:
print_explainParams(lr)

In [185]:
# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())


Model 1 was fit using parameters: 
{Param(parent='LogisticRegression_49aa5a3b7809', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_49aa5a3b7809', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_49aa5a3b7809', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_49aa5a3b7809', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_49aa5a3b7809', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_49aa5a3b7809', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegression_49aa5a3b7809', name='maxIter', doc='maximum nu

In [186]:
print_explainParamMap(model1)

In [188]:
# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)

print("Model 2 fit used these parameters: ")
print_explainParamMap(model2, False)

Model 2 fit used these parameters: 


In [189]:
# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
result = prediction.select("features", "label", "myProbability", "prediction").collect()

for row in result:
    print(
        "features=%s, label=%s -> prob=%s, prediction=%s"
        % (row.features, row.label, row.myProbability, row.prediction)
    )

features=[-1.0,1.5,1.3], label=1.0 -> prob=[0.057073041710340174,0.9429269582896599], prediction=1.0
features=[3.0,2.0,-0.1], label=0.0 -> prob=[0.9238522311704104,0.07614776882958973], prediction=0.0
features=[0.0,2.2,-1.5], label=1.0 -> prob=[0.10972776114779419,0.8902722388522057], prediction=1.0
