# Machine Learning with Optimus

Machine Learning is one of the last steps, and the goal for most Data Science WorkFlows.

Apache Spark created a library called MLlib where they coded great algorithms for Machine Learning. Now with the ML library we can take advantage of the Dataframe API and its optimization to create easily Machine Learning Pipelines.

Even though this task is not extremely hard, is not easy. The way most Machine Learning models work on Spark are not straightforward, and they need lots feature engineering to work. That’s why we created the feature engineering section inside the Transformer.

To import the Machine Learning Library you just need to import Optimus:

In [None]:
# Importing Optimus
import optimus as op

Now with Optimus you can use this really easy feature engineering with our Machine Learning Library.

Let’s take a look of what Optimus can do for you:

## ml.logistic_regression_text(df, input_col)

This method runs a logistic regression for input (text) DataFrame.

Let’s create a sample dataframe to see how it works.

In [None]:
# Import Row from pyspark
from pyspark.sql import Row
# Importing Optimus
import optimus as op

df = op.sc. \
    parallelize([Row(sentence='this is a test', label=0.),
                 Row(sentence='this is another test', label=1.)]). \
    toDF()

df.show()

In [None]:
df_predict, ml_model = op.ml.logistic_regression_text(df, "sentence")

This instruction will return two things, first the DataFrame with predictions and steps to build it with a pipeline and a Spark machine learning model where the third step will be the logistic regression.

The columns of df_predict are:

In [None]:
df_predict.columns

The names are long because those are the uid for each step in the pipeline. So lets see the prediction compared with the actual labels:

In [None]:
transformer = op.DataFrameTransformer(df_predict)
transformer.select_idx([0,6]).show()

So we just did ML with a single line in Optimus. The model is also exposed in the ml_model variable so you can save it and evaluate it.

# Tree models with Optimus

You can build Decision Trees, Random Forest models and also Gradient Boosted Trees with just one line of code in Optimus. Let’s download some sample data for analysis.

We got this dataset from Kaggle. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].

Let’s download it with Optimus and save it into a DF:

In [None]:
# Importing Optimus utils
tools = op.Utilities()

# Downloading and creating Spark DF
df = tools.read_url("https://raw.githubusercontent.com/ironmussa/Optimus/master/tests/data_cancer.csv")

We’ll choose some columns to run the Machine Learning models:

In [None]:
columns = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
           'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean','fractal_dimension_mean']

## ml.decision_tree(df, columns, input_col)

In [None]:
df_predict, dt_model = op.ml.decision_tree(df, columns, "diagnosis")

In [None]:
df_predict.show()

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator


evaluator = BinaryClassificationEvaluator(
    labelCol='label')
print(evaluator.evaluate(df_predict, 
     {evaluator.metricName: "areaUnderROC"}))
print(evaluator.evaluate(df_predict, 
     {evaluator.metricName: "areaUnderPR"}))