## Encoding categorical features
Categorical features are non-numeric features with a limited amount of possible options.

Before feeding them into most machine learning algortithms we must convert them into numerical features using either:
- **Label encoding:** Each unique category is assigned to a numerical value: 0, 1, 2, 3 ...
- **One-hot encoding:** A new binary feature is created for each category.

In [33]:
import pandas as pd
import seaborn as sns

In [34]:
titanic = sns.load_dataset("titanic")

In [35]:
cat_columns = [col for col in titanic.columns if titanic[col].dtype in ["object", "category"]]
cat_columns

['sex', 'embarked', 'class', 'who', 'deck', 'embark_town', 'alive']

In [36]:
df_categories = titanic[cat_columns]
df_categories.head()

Unnamed: 0,sex,embarked,class,who,deck,embark_town,alive
0,male,S,Third,man,,Southampton,no
1,female,C,First,woman,C,Cherbourg,yes
2,female,S,Third,woman,,Southampton,yes
3,female,S,First,woman,C,Southampton,yes
4,male,S,Third,man,,Southampton,no


In [37]:
for cat in cat_columns:
    print(f"{cat}: {df_categories[cat].unique()}")

sex: ['male' 'female']
embarked: ['S' 'C' 'Q' nan]
class: ['Third', 'First', 'Second']
Categories (3, object): ['First', 'Second', 'Third']
who: ['man' 'woman' 'child']
deck: [NaN, 'C', 'E', 'G', 'D', 'A', 'B', 'F']
Categories (7, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G']
embark_town: ['Southampton' 'Cherbourg' 'Queenstown' nan]
alive: ['no' 'yes']


### Label encoding
Label encoding maps eachcategory into a numerical value.

Use label encoding if:
- Categories have a natural order
- There are only 2 categories
- Using One-hot encoding leads to a larage number of features

In [38]:
# Manual mapping
titanic["embarked"] = titanic["embarked"].map({"S": 0, "C": 1, "q": 2})

### Auto-mapping using label encoder from Scikit-learn

In [39]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

titanic["deck"] = le.fit_transform(titanic["deck"])

### One-hot encoding
In one-hot encoding, a new binary feature is created for each category, and the value of that feature is set to 1 if the observation belongs to that category, and 0 otherwise.

Use one-hot encoding if:
- Categories have no natural order.
- Numbers of categories are small (but not 2)

In [41]:
titanic.head()
titanic.to_json("../Data/titanic_encoded.json", orient="records")

In [42]:
titanic = pd.get_dummies(data=titanic, columns=["who"])

### Auto-map remaining non-numerical features

In [31]:
cat_columns = [col for col in titanic.columns if titanic[col].dtype in ['object', 'catergory', 'bool']]
cat_columns

['sex', 'adult_male', 'embark_town', 'alive', 'alone']