# Machine Learning Fundamentals - Complete Study Notes

**Author:** ML Learning Series  
**Topic:** Foundation Concepts in Machine Learning  
**Level:** Intermediate Learner (Python + DSA basics assumed)

---

## Table of Contents
1. Data Types in Machine Learning
2. Types of Machine Learning
3. Batch (Offline) Learning
4. Online Learning
5. Learning Rate
6. Out-of-Core Learning
7. Instance-Based Learning
8. Model-Based Learning
9. Comparative Analysis: Batch vs Online
10. Comparative Analysis: Instance-Based vs Model-Based

---

## 1. Data Types in Machine Learning

### Simple Definition
Data is the foundation of any ML model. Different data types require different processing techniques and algorithms.


### Classification by Structure

#### 1.1 **Structured Data**
- **Definition:** Highly organized data with predefined format stored in tables/databases
- **Format:** Follows fixed schema with clear relationships between data points
- **Examples:**
  - Customer records in SQL databases (name, age, email, purchase_history)
  - Financial transactions (transaction_id, amount, date, account_id)
  - Inventory systems (product_id, quantity, price, location)
- **Best Algorithms:** Statistical models, Decision Trees, Linear Regression
- **Characteristics:**
  - Easy to process
  - Requires less preprocessing
  - Fits in traditional databases
  - Fast to train models

#### 1.2 **Unstructured Data**
- **Definition:** Raw data without predefined format or organization
- **Examples:**
  - Text documents and emails
  - Images and videos
  - Audio files
  - Social media posts
- **Best Algorithms:** Deep Learning (CNNs for images, RNNs for text, autoencoders)
- **Characteristics:**
  - Requires significant preprocessing
  - Needs feature extraction
  - Large storage requirements
  - High computational cost

#### 1.3 **Semi-Structured Data**
- **Definition:** Data with some organizational structure but not fully formatted
- **Examples:**
  - JSON and XML files
  - Log files with timestamps
  - Email metadata
  - Web pages with HTML tags
- **Processing:** Hybrid approach combining structured and unstructured techniques

### Classification by Representation

#### 1.4 **Numerical (Quantitative) Data**
- **Definition:** Data represented as numbers and measured on a scale

**Discrete Numerical Data:**
- Countable, takes specific integer values only
- Examples: number of students (5, 10, 100), count of items, frequency
- Range: Finite set of values
- Used in: Classification problems, counting tasks

**Continuous Numerical Data:**
- Can take any value within a range
- Examples: temperature (25.3°C, 25.4°C), weight (65.2 kg), height, salary
- Range: Infinite possibilities
- Used in: Regression problems, predictions

#### 1.5 **Categorical (Qualitative) Data**
- **Definition:** Data that represents categories or labels, not measurable on a numerical scale

**Nominal Data (No Order):**
- Categories without inherent ranking
- Examples:
  - Gender: Male, Female, Non-binary
  - Color: Red, Blue, Green
  - Car brands: Toyota, Ford, BMW
  - Department: HR, Engineering, Sales
- Characteristic: Order doesn't matter
- Encoding Required: One-hot encoding (creates binary columns for each category)

**Ordinal Data (With Order):**
- Categories with meaningful ranking or order
- Examples:
  - Education level: High School < Bachelor's < Master's < Ph.D.
  - Customer satisfaction: Poor < Average < Good < Excellent
  - Clothing sizes: XS < S < M < L < XL
  - Movie ratings: 1 star < 2 stars < 3 stars < 5 stars
- Characteristic: Order has significance
- Encoding: Label encoding (preserves order: 1, 2, 3, 4)

### Classification by Labeling

#### 1.6 **Labeled Data**
- **Definition:** Data with predefined output/target labels
- **Structure:** (Features, Label) pairs
- **Usage:** Supervised Learning
- **Example:**
  ```
  Email: "Buy now, limited offer"  →  Label: SPAM
  Email: "Project update attached"  →  Label: NOT_SPAM
  ```
- **Requirement:** Expensive and time-consuming to create
- **Applications:** Email filtering, disease diagnosis, house price prediction

#### 1.7 **Unlabeled Data**
- **Definition:** Data with only input features, no target output
- **Structure:** Only features, no labels
- **Usage:** Unsupervised Learning
- **Example:**
  ```
  Customer: [Age: 28, Income: 50000, Purchases: [electronics, books, clothes]]
  → No label, discover patterns yourself
  ```
- **Availability:** More abundant and cheaper
- **Applications:** Customer segmentation, anomaly detection, pattern discovery

### Python Implementation - Data Types

```python
import pandas as pd
import numpy as np

# Structured Data Example
structured_data = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'age': [25, 35, 45, 28],           # Continuous numerical
    'gender': ['M', 'F', 'M', 'F'],    # Nominal categorical
    'education': ['BS', 'MS', 'PhD', 'BS'],  # Ordinal categorical
    'salary': [50000, 75000, 95000, 60000],  # Continuous numerical
    'purchased': ['Yes', 'No', 'Yes', 'Yes']  # Target label (supervised)
})

# Numerical data - Discrete
discrete_numerical = np.array([5, 10, 15, 20, 25])  # Countable integers
print(f"Discrete: {discrete_numerical}")

# Numerical data - Continuous
continuous_numerical = np.array([5.5, 10.3, 15.7, 20.1, 25.9])  # Any decimal
print(f"Continuous: {continuous_numerical}")

# Categorical data - Nominal (requires one-hot encoding)
from sklearn.preprocessing import OneHotEncoder
categorical_nominal = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})
encoded = pd.get_dummies(categorical_nominal)
print("One-hot encoded:\n", encoded)

# Categorical data - Ordinal (preserve order with label encoding)
from sklearn.preprocessing import LabelEncoder
education_levels = ['HS', 'BS', 'MS', 'PhD']
education_values = pd.DataFrame({
    'education': ['BS', 'MS', 'PhD', 'HS', 'BS']
})
le = LabelEncoder()
education_encoded = le.fit_transform(education_values['education'])
print(f"Ordinal encoded: {education_encoded}")
```

---