# Simple ML PySpark Example

Does a very simple ML use case with pyspark.<br>
This case is much simpler than the ML already covered with Cambridge Spark, but uses pyspark rather than scikit-learn. I'm not spending much time on this, because the more abstract concepts are more important (and challenging) than performing actions with that vs. this library.

In [1]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Missing').getOrCreate()

In [2]:
training = spark.read.csv('Pokemon.csv',header=True,inferSchema=True)

In [3]:
training.show()

+---+--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+
|  #|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp. Atk|Sp. Def|Speed|Generation|Legendary|
+---+--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+
|  1|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|     65|     65|   45|         1|    false|
|  2|             Ivysaur| Grass|Poison|  405| 60|    62|     63|     80|     80|   60|         1|    false|
|  3|            Venusaur| Grass|Poison|  525| 80|    82|     83|    100|    100|   80|         1|    false|
|  3|VenusaurMega Venu...| Grass|Poison|  625| 80|   100|    123|    122|    120|   80|         1|    false|
|  4|          Charmander|  Fire|  null|  309| 39|    52|     43|     60|     50|   65|         1|    false|
|  5|          Charmeleon|  Fire|  null|  405| 58|    64|     58|     80|     65|   80|         1|    false|
|  6|           Cha

Will predict HP, from Attack, Defense and Speed

In [12]:
training = training.select(['HP','Attack','Defense', 'Speed'])

Making a vector, so that have a single independent feature:

In [13]:
from pyspark.ml.feature import VectorAssembler
featureassembler=VectorAssembler(inputCols=['Attack','Defense', 'Speed'],outputCol="Independent Features")

In [14]:
output=featureassembler.transform(training)

In [15]:
output.show()

+---+------+-------+-----+--------------------+
| HP|Attack|Defense|Speed|Independent Features|
+---+------+-------+-----+--------------------+
| 45|    49|     49|   45|    [49.0,49.0,45.0]|
| 60|    62|     63|   60|    [62.0,63.0,60.0]|
| 80|    82|     83|   80|    [82.0,83.0,80.0]|
| 80|   100|    123|   80|  [100.0,123.0,80.0]|
| 39|    52|     43|   65|    [52.0,43.0,65.0]|
| 58|    64|     58|   80|    [64.0,58.0,80.0]|
| 78|    84|     78|  100|   [84.0,78.0,100.0]|
| 78|   130|    111|  100| [130.0,111.0,100.0]|
| 78|   104|     78|  100|  [104.0,78.0,100.0]|
| 44|    48|     65|   43|    [48.0,65.0,43.0]|
| 59|    63|     80|   58|    [63.0,80.0,58.0]|
| 79|    83|    100|   78|   [83.0,100.0,78.0]|
| 79|   103|    120|   78|  [103.0,120.0,78.0]|
| 45|    30|     35|   45|    [30.0,35.0,45.0]|
| 50|    20|     55|   30|    [20.0,55.0,30.0]|
| 60|    45|     50|   70|    [45.0,50.0,70.0]|
| 40|    35|     30|   50|    [35.0,30.0,50.0]|
| 45|    25|     50|   35|    [25.0,50.0

In [18]:
finalised_data = output.select(['HP','Independent Features'])
finalised_data.show()

+---+--------------------+
| HP|Independent Features|
+---+--------------------+
| 45|    [49.0,49.0,45.0]|
| 60|    [62.0,63.0,60.0]|
| 80|    [82.0,83.0,80.0]|
| 80|  [100.0,123.0,80.0]|
| 39|    [52.0,43.0,65.0]|
| 58|    [64.0,58.0,80.0]|
| 78|   [84.0,78.0,100.0]|
| 78| [130.0,111.0,100.0]|
| 78|  [104.0,78.0,100.0]|
| 44|    [48.0,65.0,43.0]|
| 59|    [63.0,80.0,58.0]|
| 79|   [83.0,100.0,78.0]|
| 79|  [103.0,120.0,78.0]|
| 45|    [30.0,35.0,45.0]|
| 50|    [20.0,55.0,30.0]|
| 60|    [45.0,50.0,70.0]|
| 40|    [35.0,30.0,50.0]|
| 45|    [25.0,50.0,35.0]|
| 65|    [90.0,40.0,75.0]|
| 65|  [150.0,40.0,145.0]|
+---+--------------------+
only showing top 20 rows



In [21]:
from pyspark.ml.regression import LinearRegression
##train test split
train_data,test_data=finalised_data.randomSplit([0.75,0.25])
regressor=LinearRegression(featuresCol='Independent Features', labelCol='HP')
regressor=regressor.fit(train_data)

In [22]:
### Prediction
pred_results=regressor.evaluate(test_data)

In [23]:
pred_results.predictions.show()



+---+--------------------+------------------+
| HP|Independent Features|        prediction|
+---+--------------------+------------------+
| 20|    [25.0,45.0,60.0]| 51.91578903943794|
| 30|   [45.0,135.0,30.0]|  63.7078194338166|
| 30|   [65.0,100.0,40.0]| 66.65940895965747|
| 30|   [105.0,90.0,50.0]|  77.1021395872071|
| 31|    [45.0,90.0,40.0]| 60.38362856685312|
| 35|    [20.0,65.0,20.0]| 50.88231996559459|
| 35|   [80.0,50.0,120.0]| 69.28249161479158|
| 37|    [25.0,41.0,25.0]| 50.46507344127277|
| 38|    [30.0,41.0,60.0]| 52.95807783044712|
| 38|    [30.0,85.0,30.0]|55.557462130650876|
| 38|    [35.0,40.0,35.0]|  53.4387106138629|
| 38|    [41.0,40.0,65.0]| 56.04404636917246|
| 39|    [52.0,43.0,65.0]| 59.29316546706413|
| 40|    [29.0,45.0,36.0]| 52.23635800007891|
| 40|   [35.0,30.0,105.0]|54.881708959507954|
| 40|    [40.0,35.0,70.0]| 55.52659476673578|
| 40|    [45.0,35.0,55.0]|56.410142857161574|
| 40|    [45.0,40.0,56.0]| 56.84745221943801|
| 40|    [45.0,40.0,65.0]| 57.1371

In [25]:
## Evaluate how the model is performing
pred_results.meanAbsoluteError,pred_results.meanSquaredError

(15.606658585360195, 456.8444295623709)