# Lab 3 - Encoding Categorical Features in Spark

In [None]:
import findspark
findspark.find()
findspark.init() 

In [None]:
from pyspark.sql import SparkSession
sc1 = SparkSession.builder.appName("Lab-03_Encoding_categorical_features").getOrCreate()
sc1

In [None]:
from pyspark.ml.feature import StringIndexer

In [None]:
df = sc1.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])

df.show()

# StringIndexer <br>
StringIndexer encodes a string column of labels to a column of label indices. <br>
Four ordering options are supported: <br>
1. “frequencyDesc”: descending order by label frequency (most frequent label assigned 0), 
2. “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0), 
3. “alphabetDesc”: descending alphabetical order, and 
4. “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”). <br>

Note that in case of equal frequency when under “frequencyDesc”/”frequencyAsc”, the strings are further sorted by alphabet.

In [None]:
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex", stringOrderType='alphabetDesc')
indexed = indexer.fit(df).transform(df)
indexed.show()

# OneHotEncoder <br>

One-hot encoding maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values.

For string type input data, it is common to encode categorical features using StringIndexer first.

In [None]:
from pyspark.ml.feature import OneHotEncoder

In [None]:
df = sc1.createDataFrame([
    (0.0, 1.0),
    (1.0, 0.0),
    (2.0, 1.0),
    (0.0, 2.0),
    (0.0, 1.0),
    (2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])

df.show()

In [None]:
encoder = OneHotEncoder(inputCols=["categoryIndex1", "categoryIndex2"],
                        outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)


The output comprises of 3 values. <br>
1. First value indicates the length of the vector.
2. Second value indicates an array of indices or positions where non zero entries are found.
3. Third value indicates an array that tells which numbers are found in the indices indicated by the array in 2.

<br>
Example: (2, [1], [1.0]) denotes the vector is of length '2' (two), has a value of 1 present at the index 1 or location 1. Therefore, the one hot vector is '01'

In [None]:
encoded.show()

In [None]:
encoder1 = OneHotEncoder(inputCol='categoryIndex', outputCol='One-hot-vector')
model1 = encoder1.fit(indexed)
encoded1 = model1.transform(indexed)

In [None]:
encoded1.show()

In [None]:
sc1.stop()

One-hot-encoding is a quintessential step for preparing any dataset for machine learning modeling. This is one of the most common steps in any feature pre-processing pipeline. One-hot encoding turns categorical data into a binary vector representation. This approach creates a new column for each unique value in the original category column. 

In [None]:
import findspark
findspark.find()
findspark.init()
from pyspark.sql import SparkSession 
spark = SparkSession.builder.appName("One_Hot_Encoding").getOrCreate() 
df = spark.read.option("header", True).csv("sample.csv") 
df.show()

### Common PySpark implementation of One-Hot-Encoding
PySpark has a quite simple implementation for one-hot-encoding. It goes as follows:

- Convert the String Values to Numeric Labels/Indices
- One-Hot-Encode the Numeric Labels to a VectorUDT (pyspark.ml.linalg.VectorUDT)

In [None]:
#   ##  import the required libraries
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder

#   ##  numeric indexing for the strings (indexing starts from 0)
indexer = StringIndexer(inputCol="Color", outputCol="ColorNumericIndex")

#   ##  fit the indexer model and use it to transform the strings into numeric indices
df = indexer.fit(df).transform(df)

#   ##  one-hot-encoding the numeric indices
ohe = OneHotEncoder(inputCol="ColorNumericIndex", outputCol="ColorOHEVector")

#   ##  fit the ohe model and use it to transform the numeric indices into ohe vectors
df = ohe.fit(df).transform(df)

df.show()
#   ##  get datatype of the ohe vector column
print(df.schema["ColorOHEVector"].dataType)

#### Interpretable One Hot Encoding in PySpark
To create an interpretable One Hot Encoder, we need to create a separate column for each distinct value. This is easily done using pyspark dataframe’s in-builtwithColumn function by passing a UDF (user-defined function) as a parameter.

In [None]:
#   ##  import the required libraries
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType

Now, this is what we would need to do:

Gather all the distinct values in the column that needs to be one-hot-encoded
For each of the gathered values create a new column with column name in the format <<original column name>>_<<distinct value>> representing the presence (1) or absence (0) of the distinct value in the record

We used Pandas for creating dataframe. can we use spark native way for same?

In [None]:
#   ##  gather the distinct values
distinct_values = df.select("Color")\
                    .distinct()\
                    .rdd\
                    .flatMap(lambda x: x).collect()

In [None]:
#   ##  for each of the gathered values create a new column 
for distinct_value in distinct_values:
    function = udf(lambda item: 
                   1 if item == distinct_value else 0, 
                   IntegerType())
    new_column_name = "Color"+'_'+distinct_value
    df = df.withColumn(new_column_name, function(col("Color")))

In [None]:
df.show()

In [None]:
spark.stop()