# Lesson 27 - One-Hot Encoding

## Prepare Environment

In [0]:
import numpy as np
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler 
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression 

spark = SparkSession.builder.getOrCreate()

## Introduction

A variable is referred to as **categorical**, **qualitative**, or **nominal** if it takes on values from a finite set of categories or classes. The values might be encoded using string values or numerical labels. If the values are encoded numerically, these values are typically used only for encoding purposes and do not have any numerical or quantitative significance. The possible values that a categorical variable can assume are referred to as its levels. We will often wish to include categorical features when creating machine learning models. Before we can do so, we much first encode these variables in a way that is understandable my the training algorithms. The most commonly-used technique for encoding categorical random variables is one-hot encoding. We will introduce this technique in this section.

## Load and Prepare Data

In this lesson, we will demonstrate how to use the Titanic dataset to illustrate the use of one-hot encoding. This dataset contains information about the 887 passengers on the one and only voyage of the HMS Titanic. We are provided with the following information for each passenger: The passenger's name, age, sex, passenger class, number of familiar members aboard,fare paid for the trip, and whether or not they survived the voyage. You can find more information about this dataset here: [Titanic dataset](https://www.kaggle.com/c/titanic/data).

In [0]:
titanic_schema = 'survived BYTE, pclass BYTE, name STRING, sex STRING, age FLOAT, ss_abd BYTE, pc_abd BYTE, fare FLOAT'

titanic = (
    spark.read
    .option('delimiter', '\t')
    .option('header', True)
    .schema(titanic_schema)
    .csv('/FileStore/tables/titanic.txt')
)
 
titanic.printSchema()

We will now display the first 10 rows of the dataset.

In [0]:
titanic.show(10, truncate=False)

In [0]:
N = titanic.count()
print(N)

### Distribution of Label Values

To serve as a baseline against which we can compare our model, we will check the distribution of the label values.

In [0]:
(
    titanic
    .select('survived')
    .groupby('survived')
    .agg(
        expr('COUNT(*) as count'), 
        expr(f'ROUND(COUNT(*)/{N},4) as prop')
    )
    .show()
)

### Identify Numerical and Categorical Features

Since we will need to process the categorical features using one-hot encoding, we will create lists specifying which features are numerical and which are categorical.

In [0]:
num_features = ['age', 'ss_abd', 'pc_abd']
cat_features = ['pclass', 'sex']

### Integer Encoding of Categorical Features

Before we can apply one-hot encoding to the categorical variables, we must first perform integer encoding of the variables. In the next cell, we use a `StringIndexer` object to perform this task on the `pclass` and `sex` columns. The new integer-encoded columns are named `pclass_ix` and `sex_ix`.

In [0]:
ix_features = ['pclass_ix', 'sex_ix']
indexer = StringIndexer(inputCols=cat_features, outputCols=ix_features).fit(titanic)
titanic_ix = indexer.transform(titanic) 
titanic_ix.show(10, truncate=False)

### One-Hot Encoding

Next, we can use a `OneHotEncoder` object to perform one-hot encoding on the integer-encoded columns `plcass_ix` and `sex_ix`. The new one-hot encoded columns will be named `pclass_vec` and `sex_vec`.

In [0]:
vec_features = ['pclass_vec', 'sex_vec']

encoder = OneHotEncoder(
    inputCols=ix_features, 
    outputCols=vec_features,
    dropLast=False
).fit(titanic_ix)

titanic_enc = encoder.transform(titanic_ix) 
titanic_enc.select(cat_features + ix_features + vec_features).show(10, truncate=False)

# the pclass_vec and sex_vec is an alternate way to store vector. Contains lots of zeros.

## Sparse Encoding

- Seen above is represented as **SPARSE VECTOR**

- Dense Vector ---> [7,0,0,3,0]
  - standard way we think of a vector(or list). This is a **Dense Encoding Scheme**

- **Sparse Vector** ---> (5, [0,3], [7,3])
  1. Length of the original vector
  2. indeces for the none zero elements
  3. actual none-zero values

### Assemble Features Vector

We will now combine the numerical columns with the one-hot encoded vector columns with a `VectorAssember` to create a `features` column.

In [0]:
assembler = VectorAssembler(
    inputCols=num_features + vec_features, 
    outputCol='features'
)

train = assembler.transform(titanic_enc)
train.select(
    num_features + cat_features + vec_features + ['features']
).show(5, truncate=False)

## Logistic Regression

We are now ready to use our training data to create a logistic regression model.

### Training and Scoring Model

We will now train a logistic regression model and use it to generate predictions for the training data.

In [0]:
logreg = LogisticRegression(featuresCol='features', labelCol='survived')
logreg_model = logreg.fit(train)

train_pred = logreg_model.transform(train)
train_pred.select(['probability', 'prediction', 'survived']).show(16, truncate=False)

Next, we will evaluate our model by calculating its accuracy on the training set.

In [0]:
accuracy_eval = MulticlassClassificationEvaluator(
    predictionCol='prediction', labelCol='survived', metricName='accuracy'
)

acc = accuracy_eval.evaluate(train_pred)
print(acc)

### Generating Predictions for New Data

We end the section by applying the model to new observations.

In [0]:
new_data = spark.createDataFrame(
    [[3, 'male', 30.0, 0, 0],
     [2, 'male', 30.0, 0, 0],
     [1, 'male', 30.0, 0, 0],
     [3, 'female', 30.0, 0, 0],
     [2, 'female', 30.0, 0, 0],
     [1, 'female', 30.0, 0, 0]],
    schema = 'pclass BYTE, sex STRING, age FLOAT, ss_abd BYTE, pc_abd BYTE'
)

Before we can provide the new observations to the model, we must first apply the encoding steps to the new data.

In [0]:
temp = indexer.transform(new_data)
temp = encoder.transform(temp)
temp = assembler.transform(temp)
pred = logreg_model.transform(temp)

pred.select('pclass', 'sex', 'probability', 'prediction').show(truncate=False)