# Strings and Indexes - String's Categorization

In this notebook, we'll see two techniques:

- StringIndex: transforms string categories into numbers
- IndexToString: transforms the numbers back to string

The idea of `StringIndex` is to create a model based on a set o data and then use it to transform other data sets.

However, there can be a scenario where the new dataset has a different value (a value that was not present when the model was fitted in the first dataset). So, we can choose what it should do by passing the following parameters:

`handleInvalid`:
- Exception (default): 'error'
- Omit: 'skip'
- set the value as unknown: 'keep'

_An important thing to note is that it assign the first value for the most frequent category and so on._

## Importing

In [5]:
import pyspark, findspark
from pyspark.sql import SparkSession

findspark.init()

spark = SparkSession.builder.appName("stringindexer").getOrCreate()

In [6]:
from pyspark.ml.feature import StringIndexer, IndexToString

## Loading Data

In [7]:
churn = spark.read.load(
    "../../data/Churn.csv",
    format="csv",
    sep=";",
    header = True, 
    inferSchema=True)

churn.show(2)

+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+
|CreditScore|Geography|Gender|Age|Tenure|Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+
|        619|   France|Female| 42|     2|      0|            1|        1|             1|       10134888|     1|
|        608|    Spain|Female| 41|     1|8380786|            1|        0|             1|       11254258|     0|
+-----------+---------+------+---+------+-------+-------------+---------+--------------+---------------+------+
only showing top 2 rows



## Using StringIndexer

Taking a string column, `Geography`, into a numeric column.

We have to pass to this method the input column and the name we want to the output columns.

As the other Spark's techniques, it works similar to sklearn classes: first we fit the model, and then we apply the transformation.

In [8]:
geo_index = StringIndexer(
    inputCol="Geography",
    outputCol="geo_index"
)

indexer_model = geo_index.fit(churn)
churn = indexer_model.transform(churn)

In [9]:
churn.select("Geography", "geo_index").show(truncate=False)

+---------+---------+
|Geography|geo_index|
+---------+---------+
|France   |0.0      |
|Spain    |2.0      |
|France   |0.0      |
|France   |0.0      |
|Spain    |2.0      |
|Spain    |2.0      |
|France   |0.0      |
|Germany  |1.0      |
|France   |0.0      |
|France   |0.0      |
|France   |0.0      |
|Spain    |2.0      |
|France   |0.0      |
|France   |0.0      |
|Spain    |2.0      |
|Germany  |1.0      |
|Germany  |1.0      |
|Spain    |2.0      |
|Spain    |2.0      |
|France   |0.0      |
+---------+---------+
only showing top 20 rows



## Using IndexToString

This one works very similar to `StringIndex`, but we don't have to fit it, we just have to apply the trasnsformation an then get the categories back.

In [11]:
geo_original = IndexToString(
    inputCol="geo_index",
    outputCol="geo_original"
)

churn = geo_original.transform(churn)

In [12]:
churn.select("Geography","geo_index", "geo_original").show(truncate=False)

+---------+---------+------------+
|Geography|geo_index|geo_original|
+---------+---------+------------+
|France   |0.0      |France      |
|Spain    |2.0      |Spain       |
|France   |0.0      |France      |
|France   |0.0      |France      |
|Spain    |2.0      |Spain       |
|Spain    |2.0      |Spain       |
|France   |0.0      |France      |
|Germany  |1.0      |Germany     |
|France   |0.0      |France      |
|France   |0.0      |France      |
|France   |0.0      |France      |
|Spain    |2.0      |Spain       |
|France   |0.0      |France      |
|France   |0.0      |France      |
|Spain    |2.0      |Spain       |
|Germany  |1.0      |Germany     |
|Germany  |1.0      |Germany     |
|Spain    |2.0      |Spain       |
|Spain    |2.0      |Spain       |
|France   |0.0      |France      |
+---------+---------+------------+
only showing top 20 rows

