# DSC 255: Machine Learning

## Homework 1

### Nearest Neighbor Classification

1. *Casting an image into vector form.* A $10 \times 10$ greyscale image is mapped to a $d$-dimensional vector, with one pixel per coordinate. What is $d$?

   **Solution:** 

   Given: $l = 10$ ; $w = 10$ ; $d = l \times w$

   *Apply $d = l \times w$ and substitute $l$ and $w$*

      $d = l \times w$

      $d = 10 \times 10$

      $d = 100$
   
   $\therefore$ the $10 \times 10$ greyscale image is mapped to a $100$-dimensional vector

2. *The length of a vector.* The Euclidean (or $L_2$) length of a vector $x \in \mathbb{R}^d$ is

   $$
   \|x\| = \sqrt{\sum_{i=1}^{d} x_i^2}
   $$

   where $x_i$ is the $i$-th coordinate of $x$. This is the same as the Euclidean distance between $x$ and the origin. What is the length of the vector which has a 1 in every coordinate? Your answer may be a function of $d$.

   **Solution:** 

   Given: $\exists x \in \mathbb{R}^d : x_i = 1 \ \forall i \in \mathbb{N}^d$

   It follows that,

   $\|x\| = \sqrt{\sum_{i=1}^{d} x_i^2}$ $\rightarrow$ *Definiton of $L_2$*

   $\|x\| = \sqrt{\left(x_1^{2}+x_2^{2}+...+x_d^{2}\right)}$ $\rightarrow$ *Expand summation*

   $\|x\| = \sqrt{\left(1^{2}+1^{2}+...+1^{2}\right)}$ $\rightarrow$ $x_i = 1 \ \forall i \in \mathbb{N}^d$  

   $\|x\| = \sqrt{\left(1+1+...+1\right)}$ $\rightarrow$ $1^2 = 1$  

   $\|x\| = \sqrt{d}$ $\rightarrow$ $\left(1+1+...+1\right) = d$ , *Since there are d ones*

   $\therefore$ the length of the vector which has a 1 in every coordinate is $\|x\| = \sqrt{d}$

3. *Euclidean distance.* What is the Euclidean distance between the following two points in $\mathbb{R}^3$?

   $$
   \begin{bmatrix}
   1 \\
   2 \\
   3
   \end{bmatrix},
   \begin{bmatrix}
   3 \\
   2 \\
   1
   \end{bmatrix}
   $$

   **Solution:** 

   Let, $x = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$ , $y = \begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix}$

   It follows that,
   $x_1 = 1$ , $x_2 = 2$ , $x_3 = 3$ , $y_1 = 3$ , $y_2 = 2$ , $y_3 = 1$

   The Euclidean distance in an n-dimensional space is defined as the following:

   $$\|x-y\| = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$$

   x and y are 3-dimensional vectors. Hence, $n=3$ and the equation above becomes:

   $$\|x-y\| = \sqrt{\sum_{i=1}^{3} (x_i - y_i)^2}$$

   After, expanding the summation and substituting $x_1$... $y_3$ the Euclidean distance formula for a 3-dimensional space is:

   $\|x-y\| = \sqrt{\left((1 - 3)^2+(2 - 2)^2+(3 - 1)^2\right)}$
   
   $\|x-y\| = \sqrt{\left(4+0+4\right)}$
   
   $\|x-y\| = \sqrt{\left((1 - 3)^2+(2 - 2)^2(3 - 1)^2\right)}$

   Hence, $\|x-y\| = \sqrt{8}$

   $\therefore$ the Euclidean distance between $x = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$ , $y = \begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix}$ in $\mathbb{R}^3$ is $\|x-y\| = \sqrt{8}$

4. *Accuracy of a random classifier.* A particular data set has 4 possible labels, with the following frequencies:
   
   <center>

   | Label | Frequency |
   | :---: | :-------: |
   | $A$   | $50\%$    |
   | $B$   | $20\%$    |
   | $C$   | $20\%$    |
   | $D$   | $10\%$    |

   <center>
 

   (a) What is the error rate of a classifier that picks a label $(A, B, C, D)$ at random, each with probability $\frac{1}{4}$?

   **Solution:** 

   The Probability of event E is defined as:

   $$P(E_i) = f_i \times p_i$$
      - $f_i$ is the frequency of each event $E_i$ $\forall i \in \mathbb{N}$
      - $p_i$ is the probability of each event $E_i$ $\forall i \in \mathbb{N}$
   
   The Probability of selecting event E or event F is defined as:
   $$P(E,F) = P(E)+P(F)$$

   First, solve for $P(A)$ , $P(B)$ , $P(C)$ , $P(D)$.

   Given: $p_a = p_b = p_c = p_d = \frac{1}{4} = 0.25$ and $f_a = 0.5$ , $f_b = 0.2$ , $f_c= 0.2$ , $f_d = 0.1$  

   $$P(A) = f_a \times p_a = 0.5 \times 0.25 = 0.125$$

   $$P(B) = f_b \times p_b = 0.2 \times 0.25 = 0.050$$

   $$P(C) = f_c \times p_c = 0.2 \times 0.25 = 0.050$$

   $$P(D) = f_d \times p_d = 0.1 \times 0.25 = 0.025$$

   Hence, $P(A)=0.125$ , $P(B)=0.05$ , $P(C)=0.05$ , $P(D)=0.025$

   Now, solve for $P(A,B,C,D)$.

   $$P(A,B,C,D) = P(A) + P(B) + P(C) + P(D) = 0.125+0.05+0.050.025 = 0.25$$

   Hence, $P(A,B,C,D)$ or accuracy is $0.25$

   The error rate($ER$) is defined as: $ER = 1 - accuracy$

   Lastly, solve for the error rate.

   $$ER = 1 - accuracy = 1 - 0.25 = 0.75$$

   $\therefore$ the error rate of a classifier that picks a label $(A, B, C, D)$ at random, each with probability $\frac{1}{4}$ is $75$%.

   (b) One very simple type of classifier just returns the same label, always.
   - What label should it return?

      **Solution:** 

   - What will its error rate be?

      **Solution:** 

5. In the picture below, there are nine training points, each with label either square or star. These will be used to guess the label of a query point at $(3.5, 4.5)$, indicated by a circle.
<center>

   ![Training Points](dsc_255_hw1_5.png )

</center>
   Suppose Euclidean distance is used.
   (a) How will the point be classified by 1-NN? The options are square, star, or ambiguous.
   (b) By 3-NN?
   (c) By 5-NN?

6. We decide to use 4-fold cross-validation to figure out the right value of $k$ to choose when running $k$-nearest neighbor on a data set of size 10,000. When checking a particular value of $k$, we look at four different training sets. What is the size of each of these training sets?

7. An extremal type of cross-validation is $n$-fold cross-validation on a training set of size $n$. If we want to estimate the error of $k$-NN, this amounts to classifying each training point by running $k$-NN on the remaining $n-1$ points, and then looking at the fraction of mistakes made. It is commonly called leave-one-out cross-validation (LOOCV).

   Consider the following simple data set of just four points:
<center>

   ![Simple Data Set](dsc_255_hw1_7.png)

</center>
   What is the LOOCV error for 1-NN? For 3-NN?

### Programming Exercises

Before attempting this problem, make sure that Python 3 and Jupyter are installed on your computer.

8. **Nearest neighbor on MNIST.** For this problem, download the archive `hw1.zip`, available from the course website, and open it. The Jupyter notebook `nn-mnist.ipynb` implements a basic 1-NN classifier for a subset of the MNIST data set. It uses a separate training and test set. Begin by going through this notebook, running each segment and taking care to understand exactly what each line is doing.

   Now do the following:
   (a) For test point 100, print its image as well as the image of its nearest neighbor in the training set. Put these images in your writeup. Is this test point classified correctly?
   (b) The confusion matrix for the classifier is a $10 \times 10$ matrix $N_{ij}$ with $0 \leq i, j \leq 9$, where $N_{ij}$ is the number of test points whose true label is $i$ but which are classified as $j$. Thus, if all test points are correctly classified, the off-diagonal entries of the matrix will be zero.

   - Compute the matrix $N$ for the 1-NN classifier and print it out.
   - Which digit is misclassified most often? Least often?

   (c) For each digit $0 \leq i \leq 9$: look at all training instances of image $i$, and compute their mean. This average is a 784-dimensional vector. Use the `show_digit` routine to print out these 10 average-digits.


#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.

#####  ML Classification: Interpretation
* ML Classification: Do classes and subclasses in the sdss dataset have a unique signatures, comprised of numerical variable combinations?*
* How well do machine learning classification algorithm predict class and subclass categorical variables Is there a difference in model performance when utilizing calculated features versus not(values include in original dataset)? From the model that performs best, can we predict what subclass the Mystery QSO and Mystery Galaxy belong to?*
- Accuracy results for K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM) models ranged from ~76.9% to 98.6%. Time and time again the Random Forest model outperformed the SVM model and KNN model. Calculated features, color index (B-V) and distance had the largest impact on the KNN model accuracy. The KNN model accuracy increased 3.6% when incorporating calculated features. According to the model accuracy results classes and subclasses can be classified by leveraging numerical features in this dataset.

#####  ML Classification: Bias
- The calculated variables are just estimates and could artificially inflate model accuracy results. Model selection was influenced by the results in [Janga et al. 2023](https://www.mdpi.com/2072-4292/15/16/4112) and it is possible that better models exist for this application.