# Categorical data

Categorical data is a collection of information that is divided into groups. I.e, if an organisation or agency is trying to get a biodata of its employees, the resulting data is referred to as categorical. This data is called categorical because it may be grouped according to the variables present in the biodata such as sex, state of residence, etc.

## Types of categorical data

Categorical data often includes values and observations that can be categorized or grouped. Bar graphs and pie charts are the best ways to show this data. More specifically, there are two kinds of categorical data:

- Nominal Data
- Ordinal Data


#### Load and processing data


In [3]:
import pandas as pd

In [4]:
dataset = pd.read_csv("./datasets/Data.csv")
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [5]:
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values

In [6]:
# Rows
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [7]:
# Cols
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

#### Manage missing data


In [8]:
from sklearn.impute import SimpleImputer
import numpy as np

In [9]:
dataset['Salary'].isna().sum()

1

In [10]:
dataset['Salary']

0    72000.0
1    48000.0
2    54000.0
3    61000.0
4        NaN
5    58000.0
6    52000.0
7    79000.0
8    83000.0
9    67000.0
Name: Salary, dtype: float64

In [11]:
dataset['Age']

0    44.0
1    27.0
2    30.0
3    38.0
4    40.0
5    35.0
6     NaN
7    48.0
8    50.0
9    37.0
Name: Age, dtype: float64

In [12]:
# Define directrix to work with
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
# Use the directrix to adjust the missing data
imp = imp.fit(x[:,1:3])
# Override set of data. All the rows, cols 1 up to 3
x[:,1:3] = imp.transform(x[:,1:3])
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

#### Categorize data


In [13]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [14]:
# LabelEncoder version
""" from sklearn.preprocessing import LabelEncoder
# Label encoder column 0 'Country'

le1 = LabelEncoder()
x[:,0] = le1.fit_transform(x[:,0])
x """

" from sklearn.preprocessing import LabelEncoder\n# Label encoder column 0 'Country'\n\nle1 = LabelEncoder()\nx[:,0] = le1.fit_transform(x[:,0])\nx "

In [15]:
# Categorize country column from x
country_col_transform = ColumnTransformer(
    [
        (
            'one_hot_encoder',
            OneHotEncoder(categories='auto'),
            [0]
        )
    ],
    remainder='passthrough'
)

x = np.array(country_col_transform.fit_transform(x), dtype=np.int64)

pd.DataFrame(x, columns=["France", "Spain", "Germany", "Age", "Salary"])

Unnamed: 0,France,Spain,Germany,Age,Salary
0,1,0,0,44,72000
1,0,0,1,27,48000
2,0,1,0,30,54000
3,0,0,1,38,61000
4,0,1,0,40,63777
5,1,0,0,35,58000
6,0,0,1,38,52000
7,1,0,0,48,79000
8,0,1,0,50,83000
9,1,0,0,37,67000


In [16]:
# Categorize 'Purchased' column from y
purchased_col_transform = ColumnTransformer(
    [
        (
            'one_hot_encoder',
            OneHotEncoder(categories='auto'),
            [0]
        )
    ]
)

y = np.array(purchased_col_transform.fit_transform(x), dtype=np.int64)
pd.DataFrame(y, columns=["Yes", "No"])

Unnamed: 0,Yes,No
0,0,1
1,1,0
2,1,0
3,1,0
4,1,0
5,0,1
6,1,0
7,0,1
8,1,0
9,0,1


### Split dataset up into training and testing sets

The goal is creating a model that generalize correctly new entries (new data). Training test is useful and plays the role as proxy for new data.

- Training set: subset for training a model
- Testing set: subset for testing the trained model

Things to keep in mind

- Training subset should be big enough to produce significative results from statistic view
- Do not choose a testing subset with different features than the training set

NOTE: DO NOT USE TESTING DATA FOR TRAINING. If you see amazingly results on your evaluation metrics, it‘s possible you being training the testing subset.

Example: To Find a 99% of precision on testing and training subset because of there are duplicate entries as in the training subset as the testing subset.

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
# x data matrix; y variable to forecast
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=0)

In [19]:
# 80% for training
x_train

array([[    0,     1,     0,    40, 63777],
       [    1,     0,     0,    37, 67000],
       [    0,     0,     1,    27, 48000],
       [    0,     0,     1,    38, 52000],
       [    1,     0,     0,    48, 79000],
       [    0,     0,     1,    38, 61000],
       [    1,     0,     0,    44, 72000],
       [    1,     0,     0,    35, 58000]], dtype=int64)

In [24]:
y_train

array([[1, 0],
       [0, 1],
       [1, 0],
       [1, 0],
       [0, 1],
       [1, 0],
       [0, 1],
       [0, 1]], dtype=int64)

In [23]:
# 20 % for testing
x_test

array([[    0,     1,     0,    30, 54000],
       [    0,     1,     0,    50, 83000]], dtype=int64)

In [22]:
y_test

array([[1, 0],
       [1, 0]], dtype=int64)