# Understanding Data Types for Machine Learning

In machine learning, understanding and handling different data types is crucial for effective feature engineering, model selection, and evaluation. This notebook explores the main categories of data types with practical examples using a realistic employee dataset, focusing on their impact on ML workflows.

## Data Type Classification in ML

Data used in machine learning can be broadly classified into two main categories:

1. **Quantitative Data** (Numerical): Data that can be measured and expressed numerically. These features are often used directly in ML models or transformed for better performance.
   - **Discrete**: Countable values (e.g., number of employees, number of projects)
   - **Continuous**: Measurable values that can take any value within a range (e.g., salary, height, temperature)

2. **Qualitative Data** (Categorical): Data that represents categories or groups. These features often require encoding before being used in ML algorithms.
   - **Nominal**: Categories without inherent order (e.g., gender, department, color)
   - **Ordinal**: Categories with a natural order (e.g., education level, performance rating)

Let's explore these concepts with a practical example relevant to machine learning.

## Creating a Sample Employee Dataset for ML

Let's create a realistic employee dataset that demonstrates different data types. This dataset will serve as a foundation for feature engineering and model building in machine learning tasks, such as predicting employee performance or salary.

In [1]:
import pandas as pd
from datetime import datetime, timedelta

# Create a realistic employee dataset
employee_data = {
    'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    'name': ['Alice Johnson', 'Bob Smith', 'Carol Davis', 'David Wilson', 'Eva Brown', 
             'Frank Miller', 'Grace Chen', 'Henry Taylor', 'Iris Garcia', 'Jack Anderson'],
    'department': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'HR', 
                   'Engineering', 'Marketing', 'Sales', 'Finance', 'Engineering'],
    'years_experience': [5, 8, 3, 12, 7, 2, 6, 15, 4, 9],
    'salary': [75000.50, 82000.75, 68000.00, 95000.25, 71000.80, 
               62000.00, 78000.30, 105000.00, 73000.60, 87000.90],
    'education_level': ['Bachelor', 'Master', 'Bachelor', 'Master', 'Bachelor', 
                        'Bachelor', 'PhD', 'Master', 'Bachelor', 'Master'],
    'performance_rating': ['Good', 'Excellent', 'Average', 'Excellent', 'Good', 
                          'Average', 'Excellent', 'Excellent', 'Good', 'Good']
}

# Create DataFrame
df = pd.DataFrame(employee_data)

# Display the dataset
print("Employee Dataset:")
print(df)
print(f"\nDataset Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

Employee Dataset:
   employee_id           name   department  years_experience     salary  \
0         1001  Alice Johnson  Engineering                 5   75000.50   
1         1002      Bob Smith    Marketing                 8   82000.75   
2         1003    Carol Davis  Engineering                 3   68000.00   
3         1004   David Wilson        Sales                12   95000.25   
4         1005      Eva Brown           HR                 7   71000.80   
5         1006   Frank Miller  Engineering                 2   62000.00   
6         1007     Grace Chen    Marketing                 6   78000.30   
7         1008   Henry Taylor        Sales                15  105000.00   
8         1009    Iris Garcia      Finance                 4   73000.60   
9         1010  Jack Anderson  Engineering                 9   87000.90   

  education_level performance_rating  
0        Bachelor               Good  
1          Master          Excellent  
2        Bachelor            Average  


## Data Type Analysis for ML Feature Engineering

Let's examine the data types present in our dataset and understand how pandas interprets them by default. Recognizing these types is essential for preprocessing, encoding, and selecting appropriate ML algorithms.

In [2]:
# Display data types
print("Data Types in our Dataset:")
print(df.dtypes)

print("\n" + "="*50)
print("\nDescriptive Statistics for Numerical Columns:")
print(df.describe())

Data Types in our Dataset:
employee_id             int64
name                   object
department             object
years_experience        int64
salary                float64
education_level        object
performance_rating     object
dtype: object


Descriptive Statistics for Numerical Columns:
       employee_id  years_experience         salary
count     10.00000         10.000000      10.000000
mean    1005.50000          7.100000   79600.410000
std        3.02765          4.067486   13031.574396
min     1001.00000          2.000000   62000.000000
25%     1003.25000          4.250000   71500.750000
50%     1005.50000          6.500000   76500.400000
75%     1007.75000          8.750000   85750.862500
max     1010.00000         15.000000  105000.000000


### Extracting Columns by Data Type for Feature Engineering

In machine learning, it's important to identify which columns are discrete, continuous, or categorical. This helps in selecting appropriate preprocessing techniques and feature engineering strategies for each type. The following cell demonstrates how to extract columns by their data type using pandas.

In [3]:
# Extracting Data Types for Feature Engineering
print("Extracting columns by data type:")

# Discrete (int) columns
discrete_cols = df.select_dtypes(include=['int64', 'int32']).columns.tolist()
print(f"Discrete (int) columns: {discrete_cols}")

# Continuous (float) columns
continuous_cols = df.select_dtypes(include=['float64', 'float32']).columns.tolist()
print(f"Continuous (float) columns: {continuous_cols}")

# Categorical (string/object) columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical (string/object) columns: {categorical_cols}")

Extracting columns by data type:
Discrete (int) columns: ['employee_id', 'years_experience']
Continuous (float) columns: ['salary']
Categorical (string/object) columns: ['name', 'department', 'education_level', 'performance_rating']


## 1. Quantitative Data (Numerical) in ML

Quantitative data represents measurable quantities and is often used directly as features in machine learning models. Proper handling, such as scaling and normalization, can improve model performance.

### 1.1 Discrete Quantitative Data

Discrete data consists of countable values, typically integers representing counts or whole numbers. In ML, these features may be used as-is or transformed depending on the algorithm.

**Examples from our dataset:**
- **`employee_id`**: Unique identifier numbers (1001, 1002, 1003, ...)
- **`years_experience`**: Number of years worked (2, 3, 4, 5, 6, 7, 8, 9, 12, 15)

**ML Considerations:**
- Can be used for tree-based models without scaling
- May require normalization for linear models
- Often result from counting processes
- Gaps between values are meaningful and consistent

In [4]:
# Analyzing Discrete Data
print("\nYears of Experience Statistics:")
print(f"\tRange: {df['years_experience'].min()} to {df['years_experience'].max()} years")
print(f"\tAverage: {df['years_experience'].mean():.1f} years")


Years of Experience Statistics:
	Range: 2 to 15 years
	Average: 7.1 years


### 1.2 Continuous Quantitative Data in ML

Continuous data can take any value within a given range and is commonly used as input features for ML models. Proper scaling and transformation are often necessary for optimal model performance.

**Examples from our dataset:**
- **`salary`**: Employee salaries (62000.00, 68000.00, 71000.80, 73000.60, ...)

**ML Considerations:**
- Requires scaling (e.g., StandardScaler, MinMaxScaler) for many algorithms
- Can be used for regression and classification tasks
- Theoretically infinite number of possible values between any two points
- Can be meaningfully divided into smaller units

In [5]:
# Analyzing Continuous Data
print("Continuous Data Analysis:")
print("\nSalary Statistics:")
print(f"\tMinimum Salary: ${df['salary'].min():,.2f}")
print(f"\tMaximum Salary: ${df['salary'].max():,.2f}")
print(f"\tAverage Salary: ${df['salary'].mean():,.2f}")
print(f"\tMedian Salary: ${df['salary'].median():,.2f}")
print(f"\tSalary Range: ${df['salary'].max() - df['salary'].min():,.2f}")

Continuous Data Analysis:

Salary Statistics:
	Minimum Salary: $62,000.00
	Maximum Salary: $105,000.00
	Average Salary: $79,600.41
	Median Salary: $76,500.40
	Salary Range: $43,000.00


## 2. Qualitative Data (Categorical) in ML

Qualitative data represents categories or groups and must be encoded before being used in most machine learning models. Proper encoding (e.g., one-hot, label encoding) is essential for effective feature engineering.

### 2.1 Nominal Categorical Data in ML

Nominal data represents categories without any inherent order or ranking. These features are typically encoded using one-hot encoding for ML models.

**Examples from our dataset:**
- **`name`**: Employee names (Alice Johnson, Bob Smith, Carol Davis, ...)
- **`department`**: Department names (Engineering, Marketing, Sales, HR, Finance)

**ML Considerations:**
- One-hot encoding is commonly used
- Categories have no natural order
- Only equality comparisons make sense (equal or not equal)
- Mode is the only meaningful measure of central tendency
- Can be represented by numbers, but mathematical operations are meaningless

In [6]:
# Analyzing Nominal Categorical Data
print("Nominal Categorical Data Analysis:")

print("\nDepartment Distribution:")
dept_counts = df['department'].value_counts()
print(dept_counts)
print(f"Number of unique departments: {df['department'].nunique()}")

Nominal Categorical Data Analysis:

Department Distribution:
department
Engineering    4
Marketing      2
Sales          2
HR             1
Finance        1
Name: count, dtype: int64
Number of unique departments: 5


### 2.2 Ordinal Categorical Data in ML

Ordinal data represents categories with a natural order or ranking. In ML, these features can be encoded using label encoding or custom mappings to preserve order.

**Examples from our dataset:**
- **`education_level`**: Education levels (Bachelor < Master < PhD)
- **`performance_rating`**: Performance ratings (Average < Good < Excellent)

**ML Considerations:**
- Label encoding or custom ordinal mapping is preferred
- Categories have a natural, meaningful order
- Differences between categories may not be equal or measurable
- Median and mode are meaningful measures of central tendency
- Can use comparison operators (<, >, ≤, ≥) in addition to equality

In [7]:
# Concise Ordinal Categorical Data Analysis
print("Ordinal Categorical Data Analysis:")

# Show value counts for education_level and performance_rating
print("\nEducation Level Distribution:")
print(df['education_level'].value_counts())

print("\nPerformance Rating Distribution:")
print(df['performance_rating'].value_counts())

# Show number of unique categories
print(f"\nUnique Education Levels: {df['education_level'].nunique()}")
print(f"Unique Performance Ratings: {df['performance_rating'].nunique()}")

Ordinal Categorical Data Analysis:

Education Level Distribution:
education_level
Bachelor    5
Master      4
PhD         1
Name: count, dtype: int64

Performance Rating Distribution:
performance_rating
Good         4
Excellent    4
Average      2
Name: count, dtype: int64

Unique Education Levels: 3
Unique Performance Ratings: 3


## Summary: Data Types for Machine Learning

Let's summarize the data types present in our employee dataset and their relevance for ML feature engineering:

| Column | Data Type Category | Sub-category | Pandas Type | ML Considerations |
|--------|-------------------|--------------|-------------|------------------|
| `employee_id` | Quantitative | Discrete | int64 | Unique identifier, usually excluded from features |
| `name` | Qualitative | Nominal | object | Identifier, not used as a feature |
| `department` | Qualitative | Nominal | object | One-hot encoding for ML models |
| `years_experience` | Quantitative | Discrete | int64 | Used directly or normalized |
| `salary` | Quantitative | Continuous | float64 | Scaling required for many models |
| `education_level` | Qualitative | Ordinal | object | Ordinal encoding preserves order |
| `performance_rating` | Qualitative | Ordinal | object | Ordinal encoding preserves ranking |


