# Big Data Lecture Notes

## Part 1: Introduction to Machine Learning

### What is Machine Learning (ML)? 🤔

Machine learning gives computers the ability to learn without being explicitly programmed. It's used in many applications you see daily. For example, a McKinsey study found that ML-driven recommendations are responsible for **35% of Amazon purchases** and **75% of what people watch on Netflix**.

A formal definition by Tom Mitchell (1997) is: "A computer program is said to learn from **experience E** with respect to some class of **tasks T** and **performance measure P**, if its performance at tasks in T, as measured by P, improves with experience E".

Let's break that down with an example like **face recognition**:
* **Task (T)**: Given a new photo, recognize the name of the person.
* **Experience (E)**: A database of thousands of known faces.
* **Performance (P)**: How accurately the recognition is.

***

### Core Elements of Machine Learning

ML systems are built on three key elements: Data, Model, and Assessment.

#### 1. Data 📊
Data is the foundation of ML.
* It consists of **features** (input variables, denoted as $x \in \mathbb{R}^{d}$) and sometimes **labels** (output variables, denoted as $y \in Y$).
* A collection of these data points forms a **dataset**: $\mathcal{D}=\{x_{i},y_{i}\}_{i=1}^{n}$.
* Before feeding data to a model, it must be processed through steps like feature extraction, selection, transformation, and normalization.

#### 2. Model 🧠
The model is the algorithm that learns patterns from the data.
* In supervised learning, the goal is to find a function $f_{\theta}:X\rightarrow Y$ that maps data from the data space (X) to the label space (Y).
* The **predictive model** is represented as $Y=f_{\theta}(X)$.
* **Model Learning (Training)** is the process of finding the optimal model parameters ($\theta$) by minimizing the error between the model's predictions and the true labels in the training data, based on a loss function.

#### 3. Assessment 📈
This step evaluates how well the model performs.
* **Model Testing** involves using the learned model to make predictions on new, unseen test data: $\hat{y}=f_{\theta}(x_{test})$.
* **Performance metrics** (like accuracy) are used to assess the model's predictions against the actual values.

***

### Machine Learning Workflow Overview

The ML process is split into two main stages:

1.  **Training Stage**: A learning algorithm uses the **training data** ($D_{train}$) to find a suitable predictive model ($Y=f_{\theta}(X)$).
2.  **Prediction Stage**: The trained model is given new **test data** ($x_{test}$) to generate a **predicted label** ($\hat{y}$), which is then evaluated for performance.



[Image of a machine learning workflow diagram]


***

### Types of Machine Learning

There are two primary types of machine learning, distinguished by the kind of data they use.

#### Supervised Learning 🧑‍🏫

In supervised learning, the goal is to learn a function from **labeled training data** to predict an output for new, unlabeled input. The training data includes both input features and the correct output labels.

There are two main sub-types:

1.  **Classification**: The output is a discrete label or category.
    * **Binary Classification**: Separates inputs into one of two classes (e.g., "dog" or "not dog").
    * **Multinomial (Multi-class) Classification**: Separates inputs into one of multiple classes (e.g., "Australian shepherd," "golden retriever," or "poodle").
2.  **Regression**: The output is a continuous, real-world value (e.g., predicting ice cream sales in dollars based on temperature).

Common supervised learning algorithms in Apache Spark include Linear Regression, Logistic Regression, Decision Trees, and Support Vector Machines (SVMs).

#### Unsupervised Learning 🕵️

In unsupervised learning, the goal is to discover underlying structures and patterns in **unlabeled data**.

The two main sub-types are:

1.  **Clustering**: Divides data into groups (clusters) where items in the same cluster are more similar to each other than to those in other clusters (e.g., grouping customers by purchasing habits).
2.  **Association**: Discovers rules that describe relationships between items in a large dataset (e.g., "people who buy X also tend to buy Y").

Common unsupervised learning algorithms in Apache Spark include k-means, Latent Dirichlet Allocation (LDA), and Gaussian Mixture Models.

***

### Model Assessment and Performance

To assess a model, we need to prepare the data and measure its performance.

* **Data Preparation**:
    * **Train-Test Split**: The dataset is split into a training set (e.g., 80%) to build the model and a test set (e.g., 20%) to evaluate it on unseen data.
    * **K-fold Cross-Validation**: A technique to improve model robustness by splitting the data into 'k' folds, training on k-1 folds, and testing on the remaining one, repeating for all folds.
* **Performance Metrics** (for classification):
    * These are calculated using values from a **confusion matrix**: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
    * **Accuracy**: $Accuracy = \frac{tp + tn}{tp + tn + fp + fn}$
        * **Note**: This formula calculates the proportion of total predictions that were correct. It adds the number of correct positive predictions (tp) and correct negative predictions (tn) and divides by the total number of predictions.
    * **Precision**: $Precision = \frac{tp}{tp + fp}$
        * **Note**: This formula measures the accuracy of positive predictions. It divides the number of true positives (tp) by the total number of items predicted as positive (tp + fp).
    * **Recall (Sensitivity)**: $Recall = \frac{tp}{tp + fn}$
        * **Note**: This formula measures how many of the actual positives were correctly identified. It divides the number of true positives (tp) by the total number of actual positive items (tp + fn).
    * **F1-Score**: $F = 2 \cdot \frac{precision \cdot recall}{precision + recall}$
        * **Note**: This is the harmonic mean of Precision and Recall, providing a single score that balances both metrics.

***

### Bias vs. Variance Trade-off

Two key concepts in model performance are bias and variance.

* **Bias**: The gap between the model's average prediction and the actual value. **High bias** can cause an algorithm to miss relevant relations between features and target outputs (**underfitting**).
* **Variance**: Measures the distance of the predicted values in relation to each other. **High variance** can cause an algorithm to model the random noise in the training data (**overfitting**).

The ideal model has **low bias** and **low variance**.

* **Underfitting (High Bias, Low Variance)**: The model is too simple and does not perform well even on the training data.
* **Overfitting (Low Bias, High Variance)**: The model performs well on training data but generalizes poorly to new data.

#### How to Prevent Overfitting

1.  **Train with more data**: More data can help the model learn the true underlying patterns instead of noise.
2.  **Remove features**: A simpler model with fewer features can be less prone to overfitting.
3.  **Early stopping**: Stop the training process before the model starts to overfit, typically when performance on a validation set starts to decrease.
4.  **Cross-validation**: Use methods like k-fold cross-validation to ensure the model generalizes well to different subsets of data.

## Part 2: Featurization

### The ML Pipeline and Featurization

A typical ML pipeline involves:
1.  **Featurization**: Converting raw training data (e.g., text) into numerical **feature vectors** that algorithms can understand.
2.  **Training**: Feeding these vectors into a model to learn patterns.
3.  **Model Evaluation**: Testing different models to find the best one.



**Featurization** is the process of extracting, transforming, and selecting features. Defining the right features is often the most critical part of the ML process.

***

### Feature Extraction

This involves extracting features from "raw" data. For text data, common techniques include:

#### 1. Count Vectorizer
Converts a collection of text documents into vectors of token (word) counts. It selects the top words by frequency to create a vocabulary.

#### 2. TF-IDF (Term Frequency-Inverse Document Frequency)
A statistic that shows how important a word is to a document in a collection (corpus). It's calculated in two parts:

* **Term Frequency (TF)**: Measures how frequently a term appears in a document.
    * $TF(t,d)$
        * **Note**: This represents the frequency of term 't' in document 'd'.
* **Inverse Document Frequency (IDF)**: Measures how much information a word provides by looking at how common or rare it is across all documents.
    * $IDF(t,D) = log \frac{|D|+1}{DF(t,D)+1}$
        * **Note**: This formula calculates the importance of a term.
        * $t$: The term (word).
        * $D$: The corpus (the entire collection of documents).
        * $|D|$: The total number of documents in the corpus.
        * $DF(t,D)$: The number of documents that contain the term 't'.
        * The log scale helps to dampen the effect of very high IDF values.
* **TF-IDF Score**: The product of the two values.
    * $TFIDF(t,d,D) = TF(t,d) \cdot IDF(t,D)$
        * **Note**: This final score combines the term frequency within a document with its inverse document frequency across the entire corpus. A high score means the term is frequent in a specific document but rare overall, making it a significant feature.

* **Practice Question**: Calculate the TF-IDF for the term “example”.
    * **Answer**: Using the formulas from the slides, if Document 1 is "this is a sample" (5 words) and Document 2 is "this is another example example example" (7 words):
        * **TF("example", d2)** = 3 / 7 ≈ 0.43.
        * **TF("example", d1)** = 0.
        * **DF("example", D)** = 1 (appears in one document).
        * **IDF("example", D)** = log((2+1)/(1+1)) = log(1.5) ≈ 0.176.
        * **TF-IDF("example", d2, D)** = 0.43 * 0.176 ≈ 0.075.
        * **TF-IDF("example", d1, D)** = 0 * 0.176 = 0.

#### 3. Word2Vec
Maps each word to a unique fixed-size numerical vector. This allows the model to understand semantic relationships between words.

***

### Feature Transformation

This involves scaling, converting, or modifying features.

* **Tokenization**: Breaking text (like a sentence) into individual terms (tokens).
* **Stop Words Remover**: Removing common words (like "the", "a", "is") that appear frequently but don't carry much meaning.
* **String Indexing**: Converting a column of string labels into a column of numerical indices.
* **One-Hot Encoding (OHE)**: Maps a categorical feature index to a binary vector where a single '1' indicates the presence of a specific feature value.

* **Question**: Why can't we stop at String Indexing?
    * **Answer**: String Indexing assigns numerical values (e.g., Javascript=0, Python=1, Scala=2). Machine learning algorithms might misinterpret these numbers as having an **ordinal relationship** (e.g., that 2 is "greater" than 1), which is incorrect for categorical data without a natural order. One-Hot Encoding removes this problem by creating a binary vector for each category, so there is no implied order.

***

### Feature Selection

This involves selecting a subset of the most important features to improve model performance and reduce complexity.

* **Vector Slicer**: A tool used to extract a sub-array of features from a larger feature vector.