# Big Data Lecture Notes: Classification

## Part 1: Classification Techniques

### What is Classification? 🤔

Classification is a predictive data modeling task where new instances are assigned to predefined classes or categories. The process involves two main steps:

1.  **Training**: A classifier model is created using a training dataset that includes features and their known class labels.
2.  **Testing**: The trained model is used to predict the class for new, unseen data from a testing dataset.



Classifiers can be categorized based on the number of classes they predict:
* **Binary Classifiers**: These models predict one of only two possible outcomes. For example, an email is either "spam" or "not spam".
* **Multi-Class Classifiers**: These models predict one of more than two distinct classes. For example, classifying different types of animals.

Common classification algorithms include **Decision Trees**, **Random Forest**, and Logistic Regression.

---

### Decision Trees (DTs) 🌳

A Decision Tree (DT) is a tree-like predictive model that helps in decision-making. It works by splitting a dataset into smaller, more homogeneous sets based on the most significant features. "Homogeneous" means that the samples in a set mostly belong to the same class.

#### Structure of a Decision Tree
* **Root Node**: The top-most node that represents the entire dataset before any splits.
* **Internal Node (or Decision Node)**: Represents a "test" on a specific attribute (e.g., is gender male or female?).
* **Branch**: The outcome of a test, connecting one node to another.
* **Leaf Node (or Terminal Node)**: A node that doesn't split further and assigns a final class label (e.g., "survived" or "died").
* **Pruning**: The process of removing sub-nodes from a decision node to prevent the model from becoming too complex. This is the opposite of splitting.



#### The ID3 Algorithm
The **ID3 (Iterative Dichotomiser 3)** algorithm is a common method for building decision trees that uses two key concepts to decide how to split the data at each node: **Entropy** and **Information Gain**.

##### Entropy
Entropy is a measure of the **uncertainty or randomness** in a set of data.
* **High entropy** means the data is mixed with many different classes (high uncertainty).
* **Low entropy** means the data is mostly one class (low uncertainty).
* **Zero entropy** means the data is perfectly pure, with all samples belonging to a single class.

The formula for entropy is:
$H(S) = \sum_{i=1}^{n} -p_i \log_2(p_i)$
* **Note**: This formula calculates the randomness in a dataset S.
* $S$: The dataset.
* $n$: The number of different classes.
* $p_i$: The probability of class $i$ (i.e., the proportion of the data belonging to that class).

##### Information Gain (IG)
Information Gain measures how much the entropy is reduced by splitting the data on a particular attribute. The ID3 algorithm greedily selects the attribute with the **highest information gain** to be the next splitting node.

The formula for Information Gain is:
$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$
* **Note**: This formula calculates the reduction in uncertainty after splitting the dataset S on attribute A.
* $IG(S, A)$: The information gain from splitting dataset S by attribute A.
* $H(S)$: The entropy of the entire dataset before the split.
* $Values(A)$: The set of all possible values for attribute A.
* $\frac{|S_v|}{|S|}$: The proportion of samples in S that have the value $v$ for attribute A (this is the weight).
* $H(S_v)$: The entropy of the subset of data where attribute A has value $v$.

The ID3 algorithm repeatedly calculates the IG for all attributes and splits the data using the attribute that provides the most information, continuing until all leaf nodes are pure or another stopping condition is met.

#### Advantages and Disadvantages of Decision Trees

* **Advantages**:
    * Easy to understand and visualize.
    * Can be used to generate clear decision rules.
    * Requires very little hyper-parameter tuning.
* **Disadvantages**:
    * Can easily **overfit** the training data, leading to poor performance on new data.
    * Can have low prediction accuracy compared to other algorithms.
    * Calculations can become complex if there are many class labels.

* **Question**: What is the potential problem if a DT is built to its maximum depth on training data?
    * **Answer**: The tree will likely **overfit** the data. It will learn the noise and specific details of the training set so perfectly that it fails to generalize to new, unseen data. **Pruning** is used to combat this.

---

### Random Forests 🌲🌲🌲

A Random Forest is an **ensemble classifier**, which means it combines multiple models to improve performance. Specifically, it consists of a large number of individual decision trees. A single decision tree tends to overfit, but combining many of them can lead to a much more robust model.

The final prediction of a Random Forest is the class that gets the most votes (the **mode**) from all the individual trees in the forest.

Key optimization techniques mentioned include:
1.  **Bagging (Bootstrap Aggregating)**: Rather than training every tree on the exact same dataset, each tree is trained on a different random subset of the data. This helps to reduce the variance of the model.
2.  **Gradient Boosting**: This is another ensemble technique that builds trees one at a time, where each new tree works to improve the already trained ensemble. It combines several "weak learners" to produce an overall strong model.

#### Advantages and Disadvantages of Random Forest

* **Advantages**:
    * Highly accurate and robust to correlated predictors.
    * Can handle thousands of input variables and can even be used for feature selection.
    * Effectively handles missing data internally.
* **Disadvantages**:
    * The model is difficult to interpret; it's a "black box" compared to a single decision tree.
    * It can be slow to train if there are a large number of trees.
    * It may give unreliable predictions for data that is outside the range seen in the training data.

## Part 2: Parallel Classification

### How can we parallelize the training of ML models?

Training a single, large decision tree can be slow. We can speed this up using parallel processing.

#### Data Parallelism
In this approach, the dataset is partitioned, and multiple processors work on it simultaneously.

* **Vertical Partitioning**: The dataset's columns (features) are split across different processors. Each processor gets a subset of the features, but the record ID and target class are replicated on every processor.
* **Horizontal Partitioning ('Intra-tree node parallelism')**: The dataset's rows (records) are split across processors. All processors work together to build a **single tree**. At each node, every processor calculates Information Gain for its assigned attributes. They then share their results in a "global information sharing stage" to collectively decide on the best attribute for the split. This process is repeated for every node in the tree.



#### Result Parallelism ('Inter-tree node parallelism')
This approach also uses horizontal partitioning (splitting rows), but the goal is for processors to work on **different parts of the tree independently**.

* **Level 1 (Root Node)**: At the very beginning, all processors must work together using **data parallelism**. This is because calculating the first split requires count information from the *entire* dataset.
* **Level 2+ (Subsequent Nodes)**: After the root node is split, the tree branches out. At this point, different processors can take ownership of different branches and work on them concurrently without needing to constantly share information. For example, Processor 1 can build the "Time=Midday" sub-tree while Processor 2 builds the "Time=Sunset" sub-tree. This greatly improves efficiency as the tree grows deeper.