# Logistic regression using Spark ML

In [1]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, IDF, NGram, RegexTokenizer, StopWordsRemover
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import functions as F, types as T, Row, SparkSession

Initialise a new `SparkSession`.

In [2]:
spark = SparkSession.builder\
        .appName("Spark ML")\
        .getOrCreate()

Load the [SMS Spam Collection](http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

In [3]:
schema = T.StructType([
    T.StructField("class", T.StringType(), nullable=False),
    T.StructField("text", T.StringType(), nullable=False),
])

spam = spark.read.csv("datasets/sms_spam.tsv", sep="\t", schema=schema)

In [4]:
spam.count()

5574

In [5]:
spam.show(5)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
+-----+--------------------+
only showing top 5 rows



In [6]:
spam.groupBy(F.col("class")).count().show()

+-----+-----+
|class|count|
+-----+-----+
|  ham| 4827|
| spam|  747|
+-----+-----+



In [7]:
spam.select(F.col("text")).filter(F.col("class") == "spam").take(3)

[Row(text="Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"),
 Row(text="FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"),
 Row(text='WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.')]

## Featurisation and modelling

Convert `class` to a binary `label` that can be used for modelling.

In [8]:
spam = spam.withColumn("label", (F.col("class") == "spam").cast(T.IntegerType()))\
           .drop(F.col("class"))

In [9]:
spam.show(5)

+--------------------+-----+
|                text|label|
+--------------------+-----+
|Go until jurong p...|    0|
|Ok lar... Joking ...|    0|
|Free entry in 2 a...|    1|
|U dun say so earl...|    0|
|Nah I don't think...|    0|
+--------------------+-----+
only showing top 5 rows



### Tokenisation

[Tokenisation](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of breaking text into individual terms (usually words).

In this example we use a simple [`RegexTokenizer`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer) that converts the input string to lowercase and extracts words composed of one or more word characters (alphanumeric and underscore) using the [regular expression](https://en.wikipedia.org/wiki/Regular_expression) `\w+` (see also [RegExr](https://regexr.com/)).

In [10]:
tokenizer = RegexTokenizer(inputCol="text", outputCol="words", gaps=False, pattern="\\w+")
spamTransformed = tokenizer.transform(spam)

In [11]:
spamTransformed.take(1)

[Row(text='Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', label=0, words=['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat'])]

### $n$-grams

$n$-grams are sequences of $n$ tokens (typically words).

In this example we create 2-grams using the [`NGram`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.NGram) featuriser.

In [12]:
ngram = NGram(inputCol="words", outputCol="ngrams", n=2)
spamTransformed = ngram.transform(spamTransformed)

In [13]:
spamTransformed.take(1)

[Row(text='Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', label=0, words=['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat'], ngrams=['go until', 'until jurong', 'jurong point', 'point crazy', 'crazy available', 'available only', 'only in', 'in bugis', 'bugis n', 'n great', 'great world', 'world la', 'la e', 'e buffet', 'buffet cine', 'cine there', 'there got', 'got amore', 'amore wat'])]

### tf–idf

[Term frequency – inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (tf–idf) is widely used to capture the importance of words and $n$-grams to documents in a corpus.

In this example we build tf–idf importances using:

- [`HashingTF`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.HashingTF), which maps features to indices using the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing) and computes their frequencies;
- [`IDF`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.IDF) which rescales these frequencies and downweighs features which appear frequently in the corpus.

In [14]:
hashingTF = HashingTF(inputCol="ngrams", outputCol="rawFeatures", numFeatures=2<<8)
spamTransformed = hashingTF.transform(spamTransformed)

In [15]:
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(spamTransformed)
spamTransformed = idfModel.transform(spamTransformed)

In [16]:
spamTransformed.take(1)

[Row(text='Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', label=0, words=['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat'], ngrams=['go until', 'until jurong', 'jurong point', 'point crazy', 'crazy available', 'available only', 'only in', 'in bugis', 'bugis n', 'n great', 'great world', 'world la', 'la e', 'e buffet', 'buffet cine', 'cine there', 'there got', 'got amore', 'amore wat'], rawFeatures=SparseVector(512, {49: 1.0, 90: 1.0, 124: 1.0, 133: 1.0, 148: 1.0, 168: 1.0, 200: 1.0, 213: 1.0, 220: 1.0, 272: 1.0, 303: 1.0, 319: 1.0, 335: 1.0, 337: 1.0, 353: 1.0, 397: 1.0, 418: 1.0, 427: 1.0, 446: 1.0}), features=SparseVector(512, {49: 3.9165, 90: 3.7977, 124: 3.4556, 133: 3.8725, 148: 3.5762, 168: 3.7432, 200: 3.2695, 213: 3.6022, 220: 3.5021, 272: 3.6154, 303: 3.303, 319: 3.7134, 335: 3.5201, 337: 3.9913, 35

### Modelling

We're finally ready to fit a [`LogisticRegression`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression) model.

In [17]:
lr = LogisticRegression(maxIter=10, family="binomial")
lrModel = lr.fit(spamTransformed)

We can extract a number of classification metrics from the [`BinaryLogisticRegressionTrainingSummary`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary) contained in the `summary` attribute of the fitted model.

In [18]:
lrModel.summary.accuracy

0.9540724793684966

In [19]:
lrModel.summary.areaUnderROC

0.9687478593331962

## Cross-validation

We encapsulate all the transformation and modelling steps into a single `Pipeline` that can be used as the estimator in [`CrossValidator`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator).

In [20]:
tokenizer = RegexTokenizer(inputCol="text", outputCol="words", gaps=False, pattern="\\w+")
ngram = NGram(inputCol=tokenizer.getOutputCol(), outputCol="ngrams")
hashingTF = HashingTF(inputCol=ngram.getOutputCol(), outputCol="rawFeatures")
idf = IDF(inputCol=hashingTF.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, family="binomial")
pipeline = Pipeline(stages=[tokenizer, ngram, hashingTF, idf, lr])

Next, we define the grid of hyperparameters.

In [21]:
paramGrid = ParamGridBuilder()\
    .addGrid(ngram.n, [1, 2])\
    .addGrid(hashingTF.binary, [False, True])\
    .addGrid(hashingTF.numFeatures, [2<<x for x in range(8, 12)])\
    .addGrid(lr.regParam, [10.**x for x in range(-4, 5)])\
    .addGrid(lr.elasticNetParam, [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.])\
    .build()

In [22]:
len(paramGrid)

1008

Finally, we run a 3-fold cross-validation procedure and score models by the area under the ROC curve.

In [23]:
cv = CrossValidator(estimator=pipeline,
                    estimatorParamMaps=paramGrid,
                    evaluator=BinaryClassificationEvaluator(metricName="areaUnderROC"),
                    numFolds=3,
                    seed=42)

In [24]:
cvModel = cv.fit(spam)

We can now retrieve the 'optimal' values for the hyperparameters.

In [25]:
cvModel.bestModel.stages[1].getN()

1

In [26]:
cvModel.bestModel.stages[2].getBinary()

True

In [27]:
cvModel.bestModel.stages[2].getNumFeatures()

1024

In [28]:
lrModel = cvModel.bestModel.stages[4]
lrParams = {k.name: v for k, v in lrModel.extractParamMap().items()}

In [29]:
lrParams["regParam"]

0.01

In [30]:
lrParams["elasticNetParam"]

0.1

### Classification metrics

In [31]:
lrModel.summary.accuracy

0.9946178686759957

In [32]:
lrModel.summary.areaUnderROC

0.9995349119702344

### Confusion matrix

In [33]:
predictions = cvModel.transform(spam)

In [34]:
predictions.groupBy(F.col("label"), F.col("prediction"))\
           .count()\
           .show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0|   30|
|    0|       0.0| 4827|
|    1|       1.0|  717|
+-----+----------+-----+



### Prediction

In [35]:
df = spark.createDataFrame([
    Row(text="Hello! I was just texting to see if you'd decided to do anything tomorrow."),
    Row(text="URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot!"),
])

In [36]:
cvModel.transform(df).toPandas()

Unnamed: 0,text,words,ngrams,rawFeatures,features,rawPrediction,probability,prediction
0,Hello! I was just texting to see if you'd deci...,"[hello, i, was, just, texting, to, see, if, yo...","[hello, i, was, just, texting, to, see, if, yo...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[6.112016377977277, -6.112016377977277]","[0.9977888222878144, 0.0022111777121855566]",0.0
1,URGENT! You have won a 1 week FREE membership ...,"[urgent, you, have, won, a, 1, week, free, mem...","[urgent, you, have, won, a, 1, week, free, mem...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-3.370463161167976, 3.370463161167976]","[0.0332314255979354, 0.9667685744020647]",1.0
