<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Spark Preprocessing</p><br>

Preprocessing helps to enrich the data for example if we have weight and height features we can use them to introduce a clearer health indicator; BMI.

In [1]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import * #important for some of the operations like using 'col'

pd.pandas.set_option("display.max_columns", None)

sc = SparkSession.builder.appName('HelloOne').getOrCreate()
data_frame = sc.read.csv('house_regression_train.csv', header=True, inferSchema=True)

To introduce a new feature

In [2]:
# select a column for example to create a new MSSubClass with values scaled down ten units
data_frame = data_frame.withColumn('SubClassRed', col('MSSubClass')/10)
data_frame.printSchema()


root
 |-- Id: integer (nullable = true)
 |-- MSSubClass: integer (nullable = true)
 |-- MSZoning: string (nullable = true)
 |-- LotFrontage: string (nullable = true)
 |-- LotArea: integer (nullable = true)
 |-- Street: string (nullable = true)
 |-- Alley: string (nullable = true)
 |-- LotShape: string (nullable = true)
 |-- LandContour: string (nullable = true)
 |-- Utilities: string (nullable = true)
 |-- LotConfig: string (nullable = true)
 |-- LandSlope: string (nullable = true)
 |-- Neighborhood: string (nullable = true)
 |-- Condition1: string (nullable = true)
 |-- Condition2: string (nullable = true)
 |-- BldgType: string (nullable = true)
 |-- HouseStyle: string (nullable = true)
 |-- OverallQual: integer (nullable = true)
 |-- OverallCond: integer (nullable = true)
 |-- YearBuilt: integer (nullable = true)
 |-- YearRemodAdd: integer (nullable = true)
 |-- RoofStyle: string (nullable = true)
 |-- RoofMatl: string (nullable = true)
 |-- Exterior1st: string (nullable = true)
 |--

In [3]:
data_frame.select('SubClassRed','MSSubClass').show(10)

+-----------+----------+
|SubClassRed|MSSubClass|
+-----------+----------+
|        6.0|        60|
|        2.0|        20|
|        6.0|        60|
|        7.0|        70|
|        6.0|        60|
|        5.0|        50|
|        2.0|        20|
|        6.0|        60|
|        5.0|        50|
|       19.0|       190|
+-----------+----------+
only showing top 10 rows



We can clearly see that the new feature is the scaled down version of the old one by a factor of ten.

Just as a side note: you can change the order with which the features appear using a desired list then applying the select function.

##### Removing potentially noisy entries that appear only once in large datasets

Sometimes rare categorical features can cause problems during cross validation so we can drop entries with them.
Personally I prefer relabelling them especially when they are many of them appearing once or twice as the Condition2 feature below.

Should you need to remove them then follow the following:

In [7]:
data_frame.groupBy('Condition2').count().sort(asc('count')).show()

+----------+-----+
|Condition2|count|
+----------+-----+
|      PosA|    1|
|      RRAn|    1|
|      RRAe|    1|
|    Artery|    2|
|      RRNn|    2|
|      PosN|    2|
|     Feedr|    6|
|      Norm| 1445|
+----------+-----+



In [8]:
data_frame = data_frame.filter(data_frame.Condition2 !="PosA")
data_frame.groupBy('Condition2').count().sort(asc('count')).show()

+----------+-----+
|Condition2|count|
+----------+-----+
|      RRAn|    1|
|      RRAe|    1|
|    Artery|    2|
|      RRNn|    2|
|      PosN|    2|
|     Feedr|    6|
|      Norm| 1445|
+----------+-----+



from the code above one can see that we dropped the categorical entry "PosA"
Personally, I prefer the replacement method below unless the rare features are too few. dropping data is usually not a good idea

In [14]:
data_frame = data_frame.na.replace(['RRAn', 'RRAe', 'Artery', 'PosN', 'RRNn', 'Feedr'], ['Rare','Rare', 'Rare', 'Rare', 'Rare', 'Rare'], 'Condition2')
data_frame.groupBy('Condition2').count().sort(asc('count')).show()

+----------+-----+
|Condition2|count|
+----------+-----+
|      Rare|   14|
|      Norm| 1445|
+----------+-----+

