# 3.1: Intro and kNN Classifiers

* Labeled data: the dataset has an attribute (column/variable that is typically called the class attribute) that you are interested in *predicting* for *unseen* instances (rows)
* If you have labeled data, then you can do **supervised machine learning**
* If your attribute is categorical, then the supervised ML is a *classification task*
* If your attribute is numeric, then the supervised ML is a *regression task*
* Unlabeled data: the dataset has no such attribute you want to predict
* If you have unlabeled data, then you can do **unsupervised machine learning** (will be covered with the last unit in the class)

## The Game Plan for the Rest of the Course

**SUPERVISED ML ALGORITHMS**

* Simple linear regression classifier (regression + discretization)
* Simple kNN classifier
* Simple dummy classifier
* Naive Bayes classifier (PA6)
* decision tree classifier (PA7)
* random forest classifier (ensemble method) (project)


**UNSUPERVISED ML ALGORITHMS**
* association rule miner (PA8)
* k means cluster (PA9) $\to$ BONUS????

## Regression Example

Linear regression follows this equation:

$$
y = mx + b
$$
* $m$ = slope
* $b$ = y-intercept
* $m$ and $b$ are "learned" from a training set
* you evaulate how good that model is on a testing set

TEST SET: `[150]`  
* This means that we are predicting $y$ for $x = 150$  
* $\hat y = m(150) + b$  
* y actual = 300  
* $error = \hat y - y_{actual}$

our goal is to "convert" $y = mx + b$ into a standard API. Starting with PA4, we will start to invent ML algorithms that follow this API


### Common API Terms

* $X$: 2D feature matrix (e.g. table with features for columns) with $\vec y$ removed
* $\vec y$: 1D class vector (e.g. the attribute/feature that you want to predict)
    * Note: $X$ and $y$ are parallel!
* we build a model/algorithm on "training data"
* we evaluate a model/algorithm on "testing data"
* This means that we split $X$ and $\vec y$ into training and testing data
* $X_{train}$: Testing data matrix
* $\vec y_{train}$: Testing class vector
* $X_{test}$: Training data matrix
* $\vec y_{test}$: Training class vector
* Each algorithm will be implemented as a class
* Each class will have the same "public" API
* `fit(X_train, y_train) -> None`
    * "fits"  a model/trains an algorithm based on the provided training data
* `predict(X_test) -> y_predicted(list)`
    * makes predictions for each instance in `X_test`
* we can compare `y_precicted` and `y_test` to see how well this model did on this particular test set
    * for regression: MAE (average absolute differences between $y_{predicted}$ and $y_{test}$)
* classification: accuracy (# matches between $y_{predicted}$ and $y_{test}$ divided by the total predictions)


TASK FOR NEXT CLASS: Check out `mysimplelinearregressor.py`
* This includes our work with `compute_linear_regression()` refactored with this API design

## Warm-Up Task 2/17/22 - Setting Up This Linear Regressor

In [8]:
from mysimplelinearregressor import MySimpleLinearRegressor
import importlib
import numpy as np
importlib.reload(mysimplelinearregressor)
np.random.seed(0)

X_train = [[val] for val in range(100)]
y_train = [row[0] * 2 + np.random.normal(0,25) for row in X_train] # 1D

X_test = [[150]] # 2D

lin_reg = MySimpleLinearRegressor()
lin_reg.fit(X_train, y_train)
y_predicted = lin_reg.predict(X_test)
print(y_predicted)

[293.94940496062173]


Now let's take the example above and convert it into unit tests of `fit()` and `predict()`. You can check the `test_mysimplelinearregressor.py` in this directoy. 

Some tips for writing some test cases:

* Start with simple/common test cases
* Then move on to complex/edge test cases