<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Module 3: Feature Engineering - Categorical Features

## Categorical Features

### Lesson Objectives 

After completing this lesson, you should be able to: 

- encode categorical features with Spark's `StringIndexer`
-	encode categorical features with Spark's `OneHotEncoder`
-	know how to use each of these Motivation 
-	Categorical variables can take on only a limited number of possible values, like country, or gender
-	They represent reality. You don't have infinite variation in between countries. You do have infinite values between two integers
-	Categories are less useful than integers for computations. So internally a computer will 'translate' categorical variables to integers


### Motivation

-	In R you have factors 
-	In python pandas you have the categorical data type. What is the equivalent structure in Spark?
-	These structures usually map strings to integers in a way that makes future computations easier. In this video we will see how Spark does it


### Why Are Integers Better?

-	Spark's classifiers and regressors only work with numerical features; string features must be converted to numbers a `StringIndexer`
-	This helps keep Spark's internals simpler and more efficient
-	There's little cost in transforming categorical features to numbers, and then back to strings

In [13]:
import $ivy.`org.apache.spark::spark-sql:2.4.0` // Or use any other 2.x version here
import $ivy.`org.apache.spark::spark-mllib:2.4.0` // Or use any other 2.x version here
import  org.apache.spark.SparkContext
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)
val sc= new SparkContext("local[*]","Categorical Features")

[32mimport [39m[36m$ivy.$                                   // Or use any other 2.x version here
[39m
[32mimport [39m[36m$ivy.$                                     // Or use any other 2.x version here
[39m
[32mimport [39m[36m org.apache.spark.SparkContext
[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[36msc[39m: [32mSparkContext[39m = org.apache.spark.SparkContext@12af23da

In [14]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

[32mimport [39m[36morg.apache.spark.sql.SparkSession
[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@57b13e49
[32mimport [39m[36mspark.implicits._[39m

In [15]:
// Using a StingIndexer

val  df = spark.createDataFrame( Seq((0, "US"), (1, "UK"), (2, "FR"), (3, "US"), (4, "US"), (5, "FR") )).toDF("id", "nationality")
df.show()

+---+-----------+
| id|nationality|
+---+-----------+
|  0|         US|
|  1|         UK|
|  2|         FR|
|  3|         US|
|  4|         US|
|  5|         FR|
+---+-----------+



[36mdf[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [id: int, nationality: string]

In [16]:
// Understanding the Output of a StringIndexer
import  org.apache.spark.ml.feature.StringIndexer

val  indexer = new StringIndexer().setInputCol("nationality").setOutputCol("nIndex")
val  indexed = indexer.fit(df).transform(df)
indexed.show()

+---+-----------+------+
| id|nationality|nIndex|
+---+-----------+------+
|  0|         US|   0.0|
|  1|         UK|   2.0|
|  2|         FR|   1.0|
|  3|         US|   0.0|
|  4|         US|   0.0|
|  5|         FR|   1.0|
+---+-----------+------+



[32mimport [39m[36m org.apache.spark.ml.feature.StringIndexer

[39m
[36mindexer[39m: [32mStringIndexer[39m = strIdx_fecaeffdd8a1
[36mindexed[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [id: int, nationality: string ... 1 more field]

### Reversing the Mapping 

-	The classifiers in MLlib and spark.ml will predict numeric values that correspond to the index values
-	`IndexToString` is what you'll need to transform these numbers back into your original labels

In [17]:
// IndexToString Example 

import  org.apache.spark.ml.feature.IndexToString
val converter = new IndexToString().setInputCol("predictedIndex").setOutputCol("predictedNationality")
val  predictions = indexed.selectExpr("nIndex as predictedIndex")
converter.transform(predictions).show()

+--------------+--------------------+
|predictedIndex|predictedNationality|
+--------------+--------------------+
|           0.0|                  US|
|           2.0|                  UK|
|           1.0|                  FR|
|           0.0|                  US|
|           0.0|                  US|
|           1.0|                  FR|
+--------------+--------------------+



[32mimport [39m[36m org.apache.spark.ml.feature.IndexToString
[39m
[36mconverter[39m: [32mIndexToString[39m = idxToStr_1fca5d97b968
[36mpredictions[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [predictedIndex: double]

### OneHotEncoding

-	Suppose we are trying to fit a linear regressor that uses nationality as a feature
-	It would be impossible to learn a weight for this one feature that can distinguish between the 3 nationalities in our dataset
- It's better to instead have a separate Boolean feature for each nationality, and learn weights for those features independently


### Spark's OneHotEncoder 

-	The `OneHotEncoder` creates a sparse vector column, with each dimension of this vector of Booleans representing one of the possible values of the original feature

In [18]:
// Using a OneHotEncoder 
import  org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder().setInputCol("nIndex").setOutputCol("nVector")
val  encoded = encoder.transform(indexed)
encoded.show()

+---+-----------+------+-------------+
| id|nationality|nIndex|      nVector|
+---+-----------+------+-------------+
|  0|         US|   0.0|(2,[0],[1.0])|
|  1|         UK|   2.0|    (2,[],[])|
|  2|         FR|   1.0|(2,[1],[1.0])|
|  3|         US|   0.0|(2,[0],[1.0])|
|  4|         US|   0.0|(2,[0],[1.0])|
|  5|         FR|   1.0|(2,[1],[1.0])|
+---+-----------+------+-------------+



[32mimport [39m[36m org.apache.spark.ml.feature.OneHotEncoder
[39m
[36mencoder[39m: [32mOneHotEncoder[39m = oneHot_06c9d6bdc282
[36mencoded[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [id: int, nationality: string ... 2 more fields]

In [19]:
// The dropLast Option

val  encoder = new OneHotEncoder().setInputCol("nIndex").
setOutputCol("nVector").setDropLast(false) 

val encoded = encoder.transform(indexed)
encoded.show()

+---+-----------+------+-------------+
| id|nationality|nIndex|      nVector|
+---+-----------+------+-------------+
|  0|         US|   0.0|(3,[0],[1.0])|
|  1|         UK|   2.0|(3,[2],[1.0])|
|  2|         FR|   1.0|(3,[1],[1.0])|
|  3|         US|   0.0|(3,[0],[1.0])|
|  4|         US|   0.0|(3,[0],[1.0])|
|  5|         FR|   1.0|(3,[1],[1.0])|
+---+-----------+------+-------------+



[36mencoder[39m: [32mOneHotEncoder[39m = oneHot_9341a58a979f
[36mencoded[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [id: int, nationality: string ... 2 more fields]

### Lesson Summary

-	Having completed this lesson, you should now be able to:
- encode categorical features with Spark's `StringIndexer`
-	encode categorical features with Spark's `OneHotEncoder` 
-	know when to use each of these

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.

In [21]:
sc.stop()