## SI 670 Applied Machine Learning, Week 4:  Multi-class Classification, SVM, Data Leakage (Due Wednesday 09/28/2022 11:59pm)

For this assignment, you will be practicing various machine learning operations in scikit-learn related to linear regression, polynomial feature expansion and underfitting/overfitting.

* This homework is worth 100 points in total. Correct answers and code receive full credit, but partial credit will be awarded if you have the right idea even if your final answers aren't quite right.

* Submit your completed notebook file to the Canvas site - **IMPORTANT**: please name your submitted file `si670f22-hw4-youruniqname.ipynb`

* Any file submitted after the deadline will be marked as late. Please consult the syllabus regarding late submission policies. You can submit the homework as many time as you want, but only your latest submission will be graded.

* As a reminder, the notebook code you submit must be your own work. Feel free to discuss general approaches to the homework with classmates. If you end up forming more of a team discussion on multiple questions, please include the names of the people you worked with at the top of your notebook file.


### Collaborators, if any:

### Question 1 (20 points)

Please write the answers as well as your derivation process of the following questions. You can use either LaTeX or python code to represent your answer. For example, if you want to present <$x_1^2$>, in the LaTeX format you should write <(dollar sign) x_1^2 (dollar sign)>; in the python code format you should write <\`x_1\*\*2\`>. See [here](https://csrgxtu.github.io/2015/03/20/Writing-Mathematic-Fomulars-in-Markdown/) for how to represent more mathmatical symbols in LaTeX format.

*Note: This question 1 does not require coding.*

<!-- #### (a) (10 points) 

If you have data with features $(x_1, x_2)$, what will be the set of the expanded features after you apply the `PolynomialFeatures` transformation with `degree=3` on it? The order of the features does not matter in your answer.

#### (b) (10 points) 
The main metric we have been using to measure the quality of regression models is $R^2$, which is defined as, for n data points, $R^2 = 1 -  \frac{\sum_{i=1}^n(\hat{y}_i - y_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2}$, where $y_i, \hat{y}_i$ are the label and prediction of data point i, and $\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$. We denote $\frac{1}{n}\sum_{i=1}^n(\hat{y}_i - y_i)^2$ as *Unexplained Variation* and $\frac{1}{n}\sum_{i=1}^n(y_i - \bar{y})^2$ as *Total Variation*. 

Given 5 data points with labels (1, 3, 2, -4, 6) and two classifiers A and B, suppose the predictions of A are (1.1, 1.4, 1.3, -2, 2) and the predictions of B are (1.7, 1.3, 0.3, 2, 3). Please calculate and report the *Unexplained Variation*, *Total Variation*, and *$R^2$* for classifiers A and B respectively.  -->

#### (a) (5 points)
Suppose that $w = (3, 4)$ for an SVM decision boundary. What is the margin $M$?

#### (a) (10 points)
Suppose that $(3, 2)$ is a support vector for the SVM above, and its label is $+1$. What is the $b$ for the SVM decision boundary? (Hint: think about the positive plane)

#### (b) (5 points)
You are given 3 data points with two features $x_1$ and $x_2$ and one label $Y$ as follows,

|  X1	| X2 	| Y 	 |
|----	|----	|----	 |
|   1	|  1  | +1   |
| 0.7 |  3 	| -1   |
|   2 | 0.5 | +1   |

Suppose you have a linear classification model where $w = (0.5, 0)$. What is the hinge loss with L2 regularization for this model?

<!-- You are given 3 data points with two features $x_1$ and $x_2$ and one label $Y$ as follows,

|    X1	| X2 	| Y 	|
|----	|----	|----	|
|   1	|   1 	| 1.05 	|
|   0.7 |  3 	| 0.81 	|
|   2   |  0.5 	| 2.045 |

Suppose you have a linear regression model: $\hat{y} = w_1 x_1 + w_2 x_2$, please calculate and report the hinge loss, the L1 regularization, and the L2 regularization terms for each of the following linear models:

(i) $w_1 = 1, w_2 = 0$

(ii) $w_1 = 1, w_2 = 0.02$

If you set the regularization coefficient $\alpha=1$, which of the above two weights is preferred by the Lasso and which is preferred by Ridge regression? Could you use this example to explain why Lasso prefers sparse models? -->


#### Answer 1(a)

Write your answer to 1(a) here.



#### Answer 1(b)

Write your answer to 1(b) here.


#### Answer 1(c)

Write your answer to 1(c) here.


### Question 2 (40 points)

First use `MinMaxScaler` to scale the breast cancer data and then use `GridSearchCV` to search the `kernel`, `C`, and `gamma` parameters for `SVC`. Be careful about the data leakage issues. Please return the best hyper-parameters on cross-validation and the test score associated with the these hyper-parameters.

Please search the `kernel` from ('linear', 'rbf'), `C` from (0.1, 1, 10, 100), `gamma` from (0.1, 1, 10, 100). And please apply `random_state=0` in both `train_test_split`.

*This function should a return a tuple with four numbers, i.e. `(best_kernel, best_C, best_gamma, test_score)`.*

In [13]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

svc = SVC()

# cancer dataset
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
#split into training and test and validation sets
X_train, X_test, y_train, y_test =train_test_split(X_cancer, y_cancer, random_state=0, test_size=0.25)

svc = SVC().fit(X_train, y_train)

In [14]:
mm = MinMaxScaler().fit(X_train)
X_train_scaled = mm.transform(X_train)
X_test_scaled = mm.transform(X_test)

In [15]:
svc = SVC().fit(X_train_scaled, y_train)
svc.predict(X_test_scaled)

array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0])

In [16]:
svc = SVC().fit(X_train_scaled, y_train)
svc.score(X_test_scaled, y_test)

0.972027972027972

In [17]:
parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 10, 100], 'gamma':[0.1, 1, 10, 100]}
clf = GridSearchCV(svc, parameters)

In [22]:
svc_clf = clf.fit(X_train_scaled, y_train)
svc_clf.score(X_train_scaled, y_train)

0.9835680751173709

In [60]:
# a =svc_clf.score(X_test_scaled, y_test)
# svc_clf.cv_results_['params']

In [38]:
svc_clf.cv_results_['params'][svc_clf.best_index_], svc_clf.score(X_test_scaled, y_test)

({'C': 1, 'gamma': 1, 'kernel': 'rbf'}, 0.972027972027972)

In [59]:
# svc_clf.cv_results_

In [20]:
clf.cv_results_._['params'][search.best_index_]

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_C', 'param_gamma', 'param_kernel', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

In [2]:
svc = SVC().fit(X_train, y_train)
svc.predict(X_test)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0])

In [58]:
def answer_two():
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import cross_val_score
    import numpy as np
    
    # cancer dataset
    (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
    #split into training and test and validation sets
    X_train, X_test, y_train, y_test =train_test_split(X_cancer, y_cancer, random_state=0, test_size=0.2)
    
    #scaling
    scaler = MinMaxScaler()
    X_train_scaled = mm.fit_transform(X_train)
    X_test_scaled = mm.transform(X_test)
    
    svc = SVC().fit(X_train_scaled, y_train)
    svc.score(X_test_scaled, y_test)
    
    parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 10, 100], 'gamma':[0.1, 1, 10, 100]}
    clf = GridSearchCV(svc, parameters)
    
    svc_clf = clf.fit(X_train_scaled, y_train)
    
    best_kernel =  svc_clf.cv_results_['params'][svc_clf.best_index_]['kernel']
    best_C = svc_clf.cv_results_['params'][svc_clf.best_index_]['C']
    best_gamma = svc_clf.cv_results_['params'][svc_clf.best_index_]['gamma']
    test_score = svc_clf.score(X_test_scaled, y_test)

    return best_kernel, best_C, best_gamma, test_score

answer_two()

('rbf', 1, 1, 0.9736842105263158)

### Question 3 (40 points)

Suppose you have a dataset with some missing values and you know the values are not missing at random and the probability of missing is related to the values themselves. For example, people with higher
earnings may be less likely to reveal them. 

#### (a) (5 points) In this case, what would happen when imputing the missing values with the mean strategy?



#### Answer 3(a)

Write your anwer here.
then we will lose an important feature associated with people who earn more. This severly affects a classification like 

In [39]:
# Please run this cell first before doing question 3(b)

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()
d = {}
for i in range(len(cancer.feature_names)):
    d[cancer.feature_names[i]] = cancer.data[:, i]
d['target'] = cancer.target
df = pd.DataFrame(d)


X = df[['mean concave points', 'worst concave points']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(np.array(X), np.array(y), random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, y_train)
print('The performance based on the full data: {:.3f}'.format(lr.score(X_test, y_test)))


rng = np.random.RandomState(0)

n_samples = X_train.shape[0]
X_train_missing = X_train.copy()
u = rng.uniform(low=0.3, high=1, size=(n_samples,))
X_train_missing[np.where(u < X_train[:, 0])[0], 0] = np.nan
u = rng.uniform(low=0.3, high=1, size=(n_samples,))
X_train_missing[np.where(u < X_train[:, 1])[0], 1] = np.nan

n_samples = X_test.shape[0]
X_test_missing = X_test.copy()
u = rng.uniform(low=0.3, high=1, size=(n_samples,))
X_test_missing[np.where(u < X_test[:, 0])[0], 0] = np.nan
u = rng.uniform(low=0.3, high=1, size=(n_samples,))
X_test_missing[np.where(u < X_test[:, 1])[0], 1] = np.nan


The performance based on the full data: 0.923


#### (b) (10 points) 

Please impute the missing values using `SimpleImputer` with `strategy='mean'`. Then fit a LogisticRegression with default hyper-parameters, and return the imputed data and the test score.


*This function should a return a tuple of two arrays and one number: `(X_train_imputed, X_test_imputed, test_score)`.*


In [56]:
def answer_three_b():
    from sklearn.impute import SimpleImputer
    
    # YOUR CODE HERE.
    imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
#     imp_mean.fit(X_train)
    X_train_imputed = imp_mean.fit_transform(X_train_missing)
    X_test_imputed = imp_mean.transform(X_test_missing)
    
    linreg = LogisticRegression().fit(X_train_imputed, y_train)
#     linreg.fit(X_train_imputed, y_train)
    test_score = linreg.score(X_test_imputed, y_test)
    print(test_score)
    
    return (X_train_imputed, X_test_imputed, test_score)

answer_three_b()

0.8461538461538461


#### (c) (10 points) 

Please impute the missing values using `SimpleImputer` with `strategy='mean'` and `add_indicator=True`. Then fit a LogisticRegression with default hyper-parameters, and return the imputed data and the test score.


*This function should a return a tuple of two arrays and one number: `(X_train_imputed, X_test_imputed, test_score)`.*

In [57]:
def answer_three_c():
    from sklearn.impute import SimpleImputer

    imp_mean_ind = SimpleImputer(missing_values=np.nan, strategy='mean', add_indicator=True)
#     imp_mean_ind.fit(X_train)
    X_train_imputed_ind = imp_mean_ind.fit_transform(X_train_missing)
    X_test_imputed_ind = imp_mean_ind.transform(X_test_missing)
    
    linreg = LogisticRegression().fit(X_train_imputed_ind, y_train)
#     linreg.fit(X_train_imputed, y_train)
    test_score = linreg.score(X_test_imputed_ind, y_test)
#     print(test_score)

#     return (X_train_imputed, X_test_imputed, test_score)

answer_three_c()

0.9090909090909091


#### (d) (5 points) 

Why is adding the indicator helpful when the missing values are "missing not at random"?

#### Answer 3(d)

Having the indicator here is helpful because the values are missing based on certain conditions or we cansay values are missing systematically, which itself serves as a feature, improving the Logistic Regression prediction.
