In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler, StandardScaler

# Data Types: Characteristics and Encoding Techniques

## 1. Qualitative (Categorical) Data/Features

### 1.1 Nominal Features:
- **Definition**: Nominal features are categorical variables that have no inherent order or ranking among the categories.
- **Characteristics**:
  - The categories are **distinct** and **unordered**.
  - They are simply **labels** or **names** for different groups.
  - You **cannot perform meaningful mathematical operations** on them (e.g., adding or comparing).
- **Examples**:
  - **Gender**: Male, Female
  - **Colors**: Red, Blue, Green
  - **Countries**: USA, UK, Germany
- **Common Encoding Techniques**:
  - **One-Hot Encoding**: Transforms nominal features into binary columns for each category.
  - **Label Encoding**: Assigns a unique integer to each category, used in models like tree-based algorithms.

- **When to Use**:
  - Use **One-Hot Encoding** for nominal data in linear models to avoid numeric relationships between categories.
  - **Label Encoding** is suitable for tree-based models that treat categories as distinct.

### 1.2 Ordinal Features:
- **Definition**: Ordinal features are categorical variables that have a clear, defined order or ranking among the categories.
- **Characteristics**:
  - The categories are **ordered**, but the **differences between categories** are not meaningful.
  - You know the rank of the categories but **cannot perform arithmetic** operations like subtraction or addition on them.
- **Examples**:
  - **Education Level**: High School, Bachelor’s, Master’s, PhD
  - **Satisfaction Rating**: Low, Medium, High
- **Common Encoding Techniques**:
  - **Label Encoding**: Assigns integers based on the order of categories.
  - **Ordinal Encoding**: Explicitly encodes categories with their inherent order.

- **When to Use**:
  - Use **Label Encoding** for ordinal data, especially in tree-based models. Avoid **One-Hot Encoding** for ordinal data, as it discards the order.

### 1.3 Binary Features:
- **Definition**: Binary features are categorical variables with only two distinct values.
- **Characteristics**:
  - Two possible outcomes or categories.
  - Categories are **mutually exclusive** and **exhaustive**.
- **Examples**:
  - **Gender**: Male (0), Female (1)
  - **Yes/No**: Yes (1), No (0)
- **Common Encoding Techniques**:
  - **Binary Encoding (0/1)**: Simple and effective for most machine learning algorithms.

- **When to Use**:
  - Suitable for all models as binary encoding is straightforward and interpretable.

### 1.4 Dichotomous Features:
- **Definition**: A specific type of binary feature, often representing **presence/absence**.
- **Characteristics**:
  - Two categories, often representing opposing states.
- **Examples**:
  - **Presence/Absence** of a disease.
  - **Pass/Fail** in an exam.
- **Common Encoding Techniques**:
  - **Binary Encoding (0/1)**: Encodes absence as `0` and presence as `1`.

### Summary Table

| Type of Categorical Data | Example                 | Key Characteristics                   | Common Encoding Techniques           | When to Use                                                                 |
|--------------------------|-------------------------|----------------------------------------|--------------------------------------|-----------------------------------------------------------------------------|
| **Nominal**              | Color (Red, Blue, Green) | No order, distinct categories          | **One-Hot Encoding**, **Label Encoding** | - Use **One-Hot Encoding** for **nominal** data in linear models like Logistic Regression or SVM to avoid implying order.     |
| **Ordinal**              | Education level          | Ordered, but differences aren't meaningful | **Label Encoding** (to preserve order) | - Use **Label Encoding** for **ordinal** data to maintain the order.<br>- In tree-based models (Decision Trees, Random Forests), **Label Encoding** can also be used, as the model will not assume any order. |
| **Binary**               | Yes/No, Male/Female      | Only two distinct categories           | **Binary Encoding** (0/1)            | - Use **Binary Encoding** for two categories, commonly represented as 0 and 1. Suitable for all models including linear models and tree-based models.                                |
| **Dichotomous**          | Pass/Fail                | Same as binary, often used in social/biological fields | **Binary Encoding** (0/1)            | - Identical treatment to binary data.              |


### Encoding Examples

#### OneHotEncoder

Note that One-Hot Encoding increases the dimensionality of the dataset, which can lead to higher memory usage and potentially slow down model training for datasets with many unique categories.

In [2]:
# Create DataFrame
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
df

Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Blue
4,Red


In [3]:
# Create One-Hot Encoding instance
one_hot_encoder = OneHotEncoder(sparse_output=False)

# Apply the encoder
encoded_df = pd.DataFrame(
    one_hot_encoder.fit_transform(df[['Color']]),
    columns=one_hot_encoder.get_feature_names_out(['Color'])
)

encoded_df

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0


#### Label Encoder

In [4]:
# Create DataFrame
df = pd.DataFrame({'Education': ['High School', 'Bachelor\'s', 'Master\'s', 'PhD', 'Bachelor\'s']})
df

Unnamed: 0,Education
0,High School
1,Bachelor's
2,Master's
3,PhD
4,Bachelor's


In [5]:
# Create a LabelEncoder instance
label_encoder = LabelEncoder()

# Apply the encoder
df['Education_encoded'] = label_encoder.fit_transform(df['Education'])

# Display the encoded DataFrame
df

Unnamed: 0,Education,Education_encoded
0,High School,1
1,Bachelor's,0
2,Master's,2
3,PhD,3
4,Bachelor's,0


#### Binary Encoding

Binary Encoding is typically used for categorical variables that have only two distinct values (binary features). It is the simplest form of encoding, where you map one category to 0 and the other to 1.

We can easily perform binary encoding using LabelEncoder, but manual encoding with map is more efficient.

In [6]:
# Sample DataFrame with Binary Feature
df = pd.DataFrame({
    'Subscribed': ['Yes', 'No', 'Yes', 'No', 'Yes']  # Binary feature
})

# Manually apply Binary Encoding (0 for 'No', 1 for 'Yes')
df['Subscribed_encoded'] = df['Subscribed'].map({'Yes': 1, 'No': 0})
df

Unnamed: 0,Subscribed,Subscribed_encoded
0,Yes,1
1,No,0
2,Yes,1
3,No,0
4,Yes,1


## 2. Quantitative (Numerical) Data/Features:

### 2.1 Discrete Features:
- **Definition**: Discrete features are numerical variables that represent countable values.
- **Characteristics**:
  - These variables are **finite** and cannot take fractional values.
- **Examples**:
  - **Number of students**: 0, 1, 2, ...
  - **Number of cars**: 1, 2, 3, ...
- **Common Handling Techniques**:
  - Discrete features are often used **as-is** in most models.
  - If a discrete variable has a small number of unique values, it might be treated as a categorical variable, and techniques like **one-hot encoding** could be applied.

- **When to Use**:
  - Use as-is in most models; consider encoding if the number of unique values is small.

### 2.2 Continuous Features:
- **Definition**: Continuous features are numerical variables that can take any value within a range, including decimals.
- **Characteristics**:
  - These variables can take infinitely many values within a specific range.
- **Examples**:
  - **Height**: 170.5 cm, 180.2 cm
  - **Weight**: 70.3 kg, 80.1 kg
  - **Temperature**: 36.7°C
- **Common Transformation Techniques**:
  - **Normalization**: Scales continuous data between a specific range (e.g., 0 to 1).
  - **Standardization**: Centers the data around the mean with a standard deviation of 1.

- **When to Use**:
  - Use normalization or standardization in algorithms sensitive to the scale of the input data (e.g., SVM, k-NN).

### 2.3 Interval Features:
- **Definition**: Interval features are numerical variables where the differences between values are meaningful, but there is **no true zero point**.
- **Characteristics**:
  - Intervals between values are **consistent** and **meaningful**.
  - There is **no absolute zero** that indicates the complete absence of the variable.
  - You can perform operations like addition and subtraction, but not meaningful multiplication or division.
- **Examples**:
  - **Temperature in Celsius or Fahrenheit**: The difference between 20°C and 30°C is meaningful, but 0°C does not mean "no temperature."
  - **Dates (Years)**: The difference between the years 2000 and 2010 is meaningful, but year 0 does not represent "no time."
- **Common Transformation Techniques**:
  - **Normalization**: Scale the values to a specific range (e.g., 0 to 1).
  - **Standardization**: Centers the data around the mean with a standard deviation of 1.

- **When to Use**:
  - Apply transformations like normalization or standardization as with continuous data; understanding the data scale helps in interpreting results.

### 2.4 Ratio Features:
- **Definition**: Ratio features are numerical variables similar to interval data, but with a **true zero point**, allowing for meaningful ratios between values.
- **Characteristics**:
  - The differences between values are meaningful, and the **zero point** represents the absence of the variable.
  - You can perform all arithmetic operations, including multiplication and division.
- **Examples**:
  - **Weight**: 0 kg means no weight.
  - **Height**: 0 cm means no height.
  - **Salary**: $0 means no salary.
- **Common Transformation Techniques**:
  - **Normalization**: Scale values to a specific range.
  - **Log Transformation**: Useful for handling skewed distributions.

- **When to Use**:
  - Normalize or apply log transformations for skewed distributions in any model sensitive to data distribution.

**Note**: In practical data preprocessing for machine learning, the distinction between interval and ratio scales often doesn't change how we handle the data. Both can be subjected to the same transformations (normalization, standardization). While the theoretical differences are important in statistics, they may not significantly impact the preprocessing steps in machine learning workflows.

### Summary Table

| Type of Quantitative Data | Examples                                         | Key Characteristics                                           | Common Transformation Techniques          | When to Use                                                                                             |
|---------------------------|--------------------------------------------------|----------------------------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------|
| **Discrete**              | Number of students, Number of cars               | Countable, finite values, no fractions                         | Often used **as-is**; consider **encoding** if necessary | Use as-is in most models; consider encoding if the number of unique values is small.                    |
| **Continuous**            | Height, Weight, Temperature                      | Infinite values within a range, can take decimal values        | **Normalization**, **Standardization**     | Use normalization or standardization in models sensitive to scale (e.g., SVM, k-NN).                    |
| **Interval**              | Dates (Years), Temperature (Celsius/Fahrenheit)  | Meaningful differences between values, no true zero point      | **Normalization**, **Standardization**     | Apply transformations as with continuous data; understanding scale helps in interpreting results.       |
| **Ratio**                 | Weight, Height, Salary                           | Meaningful differences with a true zero point, allows ratios   | **Normalization**, **Log Transformation**  | Apply transformations for skewed data; useful in any model sensitive to data distribution.              |


### Encoding Examples:

#### Normalization (Min-Max Scaling)
**Example**:
Let's say we have a dataset of students' heights:
- Original data: `[150 cm, 160 cm, 170 cm, 180 cm]`
  
To normalize the data to a range of `[0, 1]` using the Min-Max normalization formula:
$$
x_{\text{norm}} = \frac{x - \text{min}(X)}{\text{max}(X) - \text{min}(X)}
$$

Applying the formula:
- Min value: `150`
- Max value: `180`

- For 150: $(150 - 150) / (180 - 150) = 0$
- For 160: $(160 - 150) / (180 - 150) = 0.333$
- For 170: $(170 - 150) / (180 - 150) = 0.666$
- For 180: $(180 - 150) / (180 - 150) = 1$

**Use Case**:
- This is useful for models like **SVM** or **k-NN**, which are sensitive to feature scaling.


In [7]:
# Create DataFrame
df = pd.DataFrame({'Height(cm)': [150, 160, 170, 180]})

# Apply Min-Max Normalization
scaler = MinMaxScaler()
df['Height(cm)_norm'] = scaler.fit_transform(df[['Height(cm)']])
print(df)


   Height(cm)  Height(cm)_norm
0         150         0.000000
1         160         0.333333
2         170         0.666667
3         180         1.000000


#### Standardization (Z-score Scaling)
**Example**:
Let's say we have height data:
- Original data: `[150 cm, 160 cm, 170 cm, 180 cm]`

To standardize the data (mean = 0, standard deviation = 1), we use the Z-score formula:
$$
x_{\text{standard}} = \frac{x - \mu}{\sigma}
$$
Where:
- $\mu$ is the mean of the data
- $\sigma$ is the standard deviation

Applying the formula:
- Mean $(\mu)$: $(150 + 160 + 170 + 180) / 4 = 165$
- Standard deviation $(\sigma)$: $\sqrt{\frac{(150 - 165)^2 + (160 - 165)^2 + (170 - 165)^2 + (180 - 165)^2}{4}} = 11.18$

- For 150: $(150 - 165) / 11.18 = -1.341$
- For 160: $(160 - 165) / 11.18 = -0.447$
- For 170: $(170 - 165) / 11.18 = 0.447$
- For 180: $(180 - 165) / 11.18 = 1.341$


**Use Case**:
- Useful for **linear regression**, **PCA**, and **logistic regression**.


In [8]:
# Create DataFrame
df = pd.DataFrame({'Height(cm)': [150, 160, 170, 180]})

# Apply Standardization
scaler = StandardScaler()
df['Height(cm)_standard'] = scaler.fit_transform(df[['Height(cm)']])
print(df)


   Height(cm)  Height(cm)_standard
0         150            -1.341641
1         160            -0.447214
2         170             0.447214
3         180             1.341641


#### Log Transformation
**Example**:
Consider the salary data with large variations: `[30,000, 50,000, 100,000, 200,000]`.

A log transformation helps compress the range:
$$
x_{\text{log}} = \log(x)
$$

Applying the formula:
- For 30,000: $\log(30,000) = 10.308953$
- For 50,000: $\log(50,000) = 10.819778$
- For 100,000: $\log(100,000) = 11.512925$
- For 200,000: $\log(200,000) = 12.206073$


**Use Case**:
- This technique is useful for handling **skewed distributions** or datasets with large variations, often used in **linear regression**.

In [9]:
# Create DataFrame
df = pd.DataFrame({'Salary($)': [30000, 50000, 100000, 200000]})

# Apply Log Transformation
df['Salary($)_log'] = np.log(df['Salary($)'])
print(df)


   Salary($)  Salary($)_log
0      30000      10.308953
1      50000      10.819778
2     100000      11.512925
3     200000      12.206073
