## 1. Streaming K-means

In this problem, we consider how to extend the k-means algorithm to process streaming data. The standard k-means algorithm loads all data points into memory. In practice, data may come in a stream, such that they are sequentially processed and dropped. The advantage of streaming algorithms is in processing data that cannot fit into the memory.
Now consider how to extend the k-means algorithm to the streaming version. 

Suppose that there are $k$ clusters. The cluster centers are randomly initialized. Once the processor receives a data point $x \in R_d$, it does the following: 1) Find the cluster whose center is the closest to $x$ (in Euclidean distance), then add x to the cluster; 2) Adjust the cluster center so that it equals the mean of all cluster members. The algorithm outputs the $k$ cluster centers after processing all data points in the stream. According to the above algorithm specification, complete the streaming algorithm for k-means:

(a) List the variables that are stored in the memory and their initial values. Which variables should be the output of the algorithm?


<div style="color:blue">
    
To design a streaming version of the k-means algorithm, we must first identify the essential variables and their initial states. The primary objective is to minimize memory usage while processing the data points coming in a stream. Here's how we can set up the variables:

1. **Number of Clusters, $k$**: This is a predefined constant indicating the number of clusters to form.

2. **Cluster Centers, $C = \{c_1, c_2, ..., c_k\}$**: These are vectors in $\mathbb{R}^d$. Initially, they are randomly initialized. Each $c_i$ represents the center of the $i$-th cluster.

3. **Counters, $N = \{n_1, n_2, ..., n_k\}$**: A set of integers representing the number of data points in each cluster. Initially, $n_i = 0$ for all $i$.

4. **Sum of Points in Each Cluster, $S = \{s_1, s_2, ..., s_k\}$**: These are vectors in $\mathbb{R}^d$. Each $s_i$ represents the cumulative sum of all points in the $i$-th cluster. Initially, $s_i = \vec{0}$ (the zero vector in $\mathbb{R}^d$) for all $i$.

    
    
</div>

(b) When the processor receives a data point $x$, state the updates that are made on the variables.


<div style="color:blue">
    
When a new data point $x \in \mathbb{R}^d$ arrives:
  1. Find the closest cluster center to $x$. Denote this cluster as $j$, where $j = \text{argmin}_i \| x - c_i \|$.
  2. Update the sum for cluster $j$: $s_j = s_j + x$.
  3. Increment the counter for cluster $j$: $n_j = n_j + 1$.
  4. Update the center of cluster $j$: $c_j = \frac{s_j}{n_j}$.

The output of the algorithm after processing all data points in the stream would be the final cluster centers $C = \{c_1, c_2, ..., c_k\}$.

This approach ensures that the entire dataset does not need to be stored in memory, making it feasible for streaming data. Only the cluster centers, the count, and the sum of points in each cluster are retained.
    
</div>


(c) In each iteration, suppose the processor receives a data point x along with its weight w > 0. We want the cluster center to be the weighted average of all cluster members. How do you modify the updates in question (b) to process weighted data?


<div style="color:blue">
    
Step 2 and 3 needs to be updated. Also, $n_j$ now tracks the total weight of the points in cluster $j$, instead of the total number of points assigned to cluster $j$
    
1. Find the closest cluster center to $x$. Denote this cluster as $j$, where $j = \text{argmin}_i \| x - c_i \|$.
2. Update the sum for cluster $j$: $s_j = s_j + wx$.
3. Increment the counter for cluster $j$: $n_j = n_j + w$.
4. Update the center of cluster $j$: $c_j = \frac{s_j}{n_j}$.

    
</div>


## 2. Cars and Clusters

(a) Imagine $n$ cars, each of which travels at a different maximum speed. Initially, the cars are queued up in uniform random order at the starting point of a semi-infinite, one lane highway. Each car drives at the minimum of its maximum speed and the speed at which the car in front of it is driving. The cars will form 'clumps'/clusters. What is the expected number of clumps? Prove your answer.


<div style="color:blue">
    
**1. Define variables:**

- Let $n$ represent the number of cars.
- Let $C_n$ represent the expected number of clumps for $n$ cars.

**2. Establish a base case:**

- If there's only one car ($n = 1$), there's only one clump. So, $C_1 = 1$.

**3. Apply a recursive approach:**

- Consider adding the $n$-th car to $n-1$ cars already on the highway.
- The $n$-th car either forms a new clump (if it's the slowest) or joins an existing clump.
- The probability of it being the slowest is $\frac{1}{n}$.

**4. Formulate the recursive relation:**

$C_n = C_{n-1} + \frac{1}{n}$

**5. Solve the recursion:**

- Expand the recursion:

$C_n = C_{n-2} + \frac{1}{n-1} + \frac{1}{n} = C_{n-3} + \frac{1}{n-2} + \frac{1}{n-1} + \frac{1}{n} = ... = C_1 + \frac{1}{2} + \frac{1}{3} + ... + \frac{1}{n}$

- Substitute $C_1 = 1$:

$C_n = 1 + \frac{1}{2} + \frac{1}{3} + ... + \frac{1}{n}$

**6. Recognize the harmonic series:**

- The sum $1 + \frac{1}{2} + \frac{1}{3} + ... + \frac{1}{n}$ is the $n$-th harmonic number, denoted as $H_n$.

**7. Conclusion:**

- The expected number of clumps is $C_n = H_n$, the $n$-th harmonic number.

</div>

(b) Consider the following random graph model with clustering. For n nodes, we have $P(n, 3)$ distinct ‘triplets’. For each triplet, with independent probability $p$ we connect the nodes belonging to this triplet in the graph using three edges to form a triangle, where 

$$p = \frac{c}{P(n-1, 2)}$$

$c$ is a constant. Assume $n$ is very large. Show that the expected degree of a node in this graph model is $2c$.

<div style="color:blue">

$$ E[\text{Total #Edges}] = P(n-1, 2) \times p \times 3 = \frac{cn}{3} \times 3 = cn$$
$$ E[\text{Node Degree}] = 2 \times E[\text{Total \# Edges}] / n = 2c$$

</div>

## 3. Support Vector Machine

Given 2-dimensional input data points S1 = {(1, 4), (1, 5), (2, 4), (2, 5), (3, 3)}, S2 = {(3,2),(3,1),(4,1),(5,1),(6,1),(6,2)},where $S_1$ has the data points from the positive class and $S_2$ has data points from the negative class:

(a) Suppose you are using a linear SVM with no provision for noise (i.e. a Linear SVM that is trying to maximize its margin while ensuring all data points are on their correct sides of the margin). Draw three lines on the above diagram, showing classification boundary and the two sides of the margin. Circle the support vector(s).

<div style="color:blue">
    
</div>

(b) Using the familiar Linear SVM classifier notation of the classifier sign(wT x + b), calculate the values of w and b learned for part (a).

<div style="color:blue">
    
The learned parameters using the Linear SVM classifier are:

$w = (0, 2)$
$b = -5$

So the classifier can be described by the equation $\text{sign}(w^T x + b)$, which in this case is $\text{sign}([0, 2] \cdot x - 5)$.
    
    
</div>


(c) Assume you are using a noise-tolerant Linear SVM which tries to optimize 

$$
\begin{gathered}
\min _{w, b, \epsilon} \frac{1}{2} \boldsymbol{w}^T \boldsymbol{w}+C \Sigma_i \epsilon^i \\
\text { s.t. } y_i\left(\boldsymbol{w}^T \boldsymbol{x}^i+b\right) \geq 1-\epsilon^i, \epsilon^i \geq 0, \forall i
\end{gathered}
$$

Question: is it possible to invent a dataset and a positive value of $C$ in which the dataset is linearly separable but the linear SVM classifier would never the less misclassify at least one training point? If it is possible to invent such an example, please sketch the example and suggest a value for $C$. If it is not possible, explain why not.


<div style="color:blue">
    
It is possible
    
When $C$ is large, the SVM model tries to minimize the slack variables \( \epsilon^i \) as much as possible to reduce misclassification. In a linearly separable case, it is possible to achieve \( \epsilon^i = 0 \) for all \( i \), leading to no misclassification.

A small value of $C$ makes the SVM more tolerant to violations (misclassifications). Even if the data is linearly separable, the model might choose to misclassify some points if it leads to a simpler decision boundary (smaller $\boldsymbol{w}$). This happens because the penalty for misclassification is lower, and the optimization might find a solution where the gain in simplicity of the model (smaller $\boldsymbol{w}$) outweighs the cost of misclassification.

**Example**:
- Imagine a dataset with two linearly separable classes. 
- If we set $C$ to a very small value (e.g., $C = 0.01$), the SVM might prioritize a simpler model over perfectly classifying all points.
- In this scenario, the SVM might misclassify one or more points close to the decision boundary to maintain a wider margin and a simpler hyperplane, characterized by a smaller norm of $\boldsymbol{w}$.
    
</div>

## 4. Decision Tree Classifier

Alice is a cyber analyst designing a binary classifier to detect network intrusions in at a large technology company. She is considering using a decision tree (classification tree) for this task.


(a) In Alice’s context, what would the positive class typically refer to?

<div style="color:blue">


If we let $y$ be the class label, then:

- $y = 1$ (positive class) implies an intrusion is detected.
- $y = 0$ (negative class) implies normal activity, no intrusion.
    
</div>

(b) Alice is considering three common approaches to measure her tree’s classification error. Briefly describe each approach, and state at least one drawback for each approach.

* i. Misclassification rate 
* ii. Average loss
* iii. Normalized negative log-likelihood (or cross-entropy)


<div style="color:blue">
    

i. **Misclassification Rate**

The misclassification rate is a straightforward measure of error. It is the proportion of the total number of predictions that were incorrect.

$$\text{Misclassification Rate} = \frac{\text{Number of incorrect predictions}}{\text{Total number of predictions}}$$

_Drawback_: It treats all errors equally, which might not be appropriate in contexts where the cost of false positives and false negatives is different. For example, in network intrusion detection, a false negative (failing to detect an intrusion) can be more dangerous than a false positive (wrongly detecting an intrusion).

ii. **Average Loss**

Average loss involves calculating the average of the loss function across all predictions. The loss function quantifies how far the predictions are from the actual values.

$$\text{Average Loss} = \frac{1}{N} \sum_{i=1}^{N} L(y_i, \hat{y}_i)$$

where $L(y_i, \hat{y}_i)$ is the loss function, $y_i$ is the true label, $\hat{y}_i$ is the predicted label, and $N$ is the number of samples.

_Drawback_: The choice of loss function can significantly affect the results, and some loss functions might not be suitable for all kinds of data, especially if the data is imbalanced.

iii. **Normalized Negative Log-Likelihood (or Cross-Entropy)**

This approach, often used in logistic regression, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.

$$\text{Cross-Entropy} = - \frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i)]$$

where $\hat{p}_i$ is the predicted probability of the observation belonging to the positive class.

_Drawback_: It assumes the outputs of the model are probabilities, which might not always be the case, especially for some types of classifiers. Also, it can be heavily impacted by outliers or mislabeled instances.
    
</div>

(c) Alice is consider using a ROC (receiver operating characteristic) curve to visualize her classifier’s performance. Her colleague Bob suggests she use AUC (area under an ROC curve) to summarize each ROC into a single AUC value instead, so the AUC values may be more easily compared.

i. Briefly explain why Bob’s suggestion of using AUC may be problematic.


<div style="color:blue">

Imbalanced Classes: If the data is highly imbalanced (i.e., the number of instances in one class significantly outweighs the number of instances in the other), the AUC might present an overly optimistic view of the model's performance. In such cases, even a poor classifier can achieve a high AUC by merely being good at distinguishing the majority class.

Lack of Discrimination Between Distributions: AUC measures the model's ability to rank predictions rather than its accuracy. It doesn't distinguish between the types of errors (false positives and false negatives). Two models could have the same AUC but very different distributions of false positives and false negatives, which is crucial in applications like intrusion detection where the cost of false negatives might be much higher than false positives.

No Insight into Threshold Selection: AUC doesn't provide information about the optimal threshold for classification. The ROC curve itself is a plot of the true positive rate against the false positive rate at various threshold settings, but the AUC summarizes this into one metric, losing information about which threshold might be the best for a specific operational context.

Scale Insensitivity: AUC is scale-invariant, meaning it measures how well predictions are ranked rather than their absolute values. This could be a problem if the absolute values of the predictions have practical significance, which the AUC doesn't capture.
    
</div>

ii. Alice finds that one of her trees has an AUC score of 0. Her colleague Bob notices this and is very happy with the score — why?

<div style="color:blue">
    
0.5 indicates a model with no discriminative ability (i.e., it's as good as random guessing).
1 indicates perfect classification.
0 indicates the worst possible classification.

An AUC score of 0 would mean that the classifier consistently classifies positive instances as negative and negative instances as positive. By simply inverting its decisions (considering its negative predictions as positive and vice versa), the model would achieve an AUC score of 1, implying perfect classification.
    
</div>