# HW 7

This assignment covers several aspects of Regularizations & Tree type Classifier. 
**DO NOT ERASE MARKDOWN CELLS AND INSTRUCTIONS IN YOUR HW submission**

  * **Q** - QUESTION
  * **A** - Where to input your answer

## Instructions

Keep the following in mind for all notebooks you develop:
* Structure your notebook. 
* Use headings with meaningful levels in Markdown cells, and explain the questions each piece of code is to answer or the reason it is there.
* Make sure your notebook can always be rerun from top to bottom.
* Please start working on this assignment as soon as possible. If you are a beginner in Python this might take a long time. One of the objectives of this assignment is to help you learn python and scikit-learn package. 
* See [README.md](README.md) for homework submission instructions

## Related Tutorials
    
* [Lasso Regression - L1 Regularization](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html)

* [Ridge Regression - L2 Regularization](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)

* [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

* [Metrics : Precision-Recall curve](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)

# Data Processing

**Data** 
* Get the exploratory data and the folowing files from [link](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)
* Save metadata and the original data from download [Link](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) to your local HW folder. 
* If you are using command line, the commands are:  
```
>> wget https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
>> wget https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
``` 
* wget instructions: 
  * dowload it from [link](https://eternallybored.org/misc/wget/) 
  * follow [steps](https://stackoverflow.com/questions/29113456/wget-not-recognized-as-internal-or-external-command)

**Q1** Get training data from the dataframe
1. Load breast-cancer-wisconsin.data into data frame
2. Note: the data file does not contains column names, so include appropriate column names by exploring the metadata file
3. Replace Non-Numeric values with 0
4. Replace Class label ```2 with 0``` and ```4 with 1```
4. Assign values of ```Class``` column to ```y```, note you have to use ```.values``` method
5. Drop ```Class``` column from data frame,
6. Assign df values to x
7. Split dataset into train and test data use train_test_split with test_size = 0.2, stratify y and random_state = 1238

**A1** Replace ??? with code in the code cell below

In [25]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#Read the breast-cancer-wisconsin.data file using the prropriate separator as input to read_csv()
columns=['Sample Code Number', 'Clump Thickness', 'Uniformity of Cell Size', 
         'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size'
         'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']
df = pd.read_csv("breast-cancer-wisconsin.data",names=columns, sep=',')

#print the head
print(df.head())


# Replace Non numeric values with 0 
#df = df.replace(np.nan, value=0)
df = df.replace('?', value=0)

#Replace the Class label values following above given instructions
df['Class'].replace(2, 0)
df['Class'].replace(4, 1)

         Sample Code Number  Clump Thickness  ...  Mitoses  Class
1000025                   5                1  ...        1      2
1002945                   5                4  ...        1      2
1015425                   3                1  ...        1      2
1016277                   6                8  ...        1      2
1017023                   4                1  ...        1      2

[5 rows x 10 columns]


1000025    2
1002945    2
1015425    2
1016277    2
1017023    2
          ..
776715     2
841769     2
888820     1
897471     1
897471     1
Name: Class, Length: 699, dtype: int64

In [26]:
# Assign values of ```Class``` column to y, note you have to use .values method
y = df.Class.values
# Drop 'Class' column from data frame,
df.drop(columns=['Class'], inplace=True)
# Assign df values to x
x = df
# View shape of x and y
print(x.shape)
print(y.shape)

xtrain, xtest, ytrain, ytest = train_test_split(x,y, test_size = 0.2, random_state=1238, stratify=y)

(699, 9)
(699,)


# Model Regularization

## Ridge Regularization/ Ridge Regression

**Q2** Train Ridge Regularization Model
1. Create a Ridge Regularization Model using sklearn library, ```(See the documenttaion for details)```
2. Fit the model with the train data
3. Predict the values with the test data
4. Print the coefficients of the model
5. Calculate the test MSE 
6. Get the score from the model using test data
7. Plot Precision-Recall Curve from the true & predicted test data

**Note**
* Here we generate an array of cost values ranging from very big to very small 
  * cost here is variable alpha: alpha is equivalent to lambda in the lesson 13. 
 ![RidgeRegression](../figures/RidgeRegression.jpg) 
* Associated with each alpha value is a vector of ridge regression coefficients that we store in a matrix, with 100 rows (one for each value of alpha) and 10 columns (one for each predictor).  

**A2** Replace ??? with code in the code cell below

In [27]:
from sklearn.preprocessing import scale 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.metrics import mean_squared_error


alphas = 10**np.linspace(10,-2,100)*0.5
ridge = Ridge(normalize = True)
coefs = []

for a in alphas:
    ridge.set_params(alpha = a)
    ridge.fit(x,y)
    coefs.append(ridge.coef_)
    
np.shape(coefs) 

pred1 = ridge.predict(xtest)      
print(pd.Series(ridge.coef_, index = df.columns[0:11])) # Print coefficients
mse = mean_squared_error(ytest, pred1)         # Calculate the test MSE
print("Test mean squared error (MSE): {:.2f}".format(mse))

# print score
print(ridge.score)

Sample Code Number                        0.063993
Clump Thickness                           0.043881
Uniformity of Cell Size                   0.034681
Uniformity of Cell Shape                  0.011738
Marginal Adhesion                         0.014764
Single Epithelial Cell SizeBare Nuclei    0.091024
Bland Chromatin                           0.041821
Normal Nucleoli                           0.035267
Mitoses                                   0.004837
dtype: float64
Test mean squared error (MSE): 0.13
<bound method RegressorMixin.score of Ridge(alpha=0.005, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=True, random_state=None, solver='auto', tol=0.001)>


### Precision-Recall Curve for Ridge1

In [28]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix
import matplotlib.pyplot as plt

#pass necessary parameters to precision_recall_curve method
precision, recall, thresholds = precision_recall_curve(ytest, pred1)

# find threshold closest to zero
close_zero = np.argmin(np.abs(thresholds))

plt.plot(precision[close_zero], recall[close_zero], 'o', markersize=10,
         label="threshold zero", fillstyle="none", c='k', mew=2)

plt.plot(precision, recall, label="precision recall curve")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="best")


ValueError: ignored

**Q3** Train Ridge Regression Model on the training set, and evaluate
1. Now, Create a Ridge Regression passing ```alpha = 4, normalize = True to Ridge()```
2. Fit the model with the train data
3. Predict the values with the test data
4. Print the coefficients of the model
5. Calculate the test MSE 
6. Get the score from the model using test data
7. Plot Precision-Recall Curve from the true & predicted test data

**A3** Replace ??? with code in the code cell below

In [None]:
ridge2 = Ridge(alpha = 4, normalize = True)
ridge2.fit(xtrain, ytrain)
pred2 = ridge2.predict(xtest)


#print model coefficients      
print(pd.Series(ridge2.coef_, index = df.columns[0:11])) 

mse = mean_squared_error(ytest, pred2)       
print("Test mean squared error (MSE): {:.2f}".format(mse))

#print score
print(ridge2.score)

In [None]:
#pass necessary parameters to precision_recall_curve method  
precision, recall, thresholds = precision_recall_curve(ytest, pred3)

# find threshold closest to zero
close_zero = np.argmin(np.abs(thresholds))

plt.plot(precision[close_zero], recall[close_zero], 'o', markersize=10,
         label="threshold zero", fillstyle="none", c='k', mew=2)

plt.plot(precision, recall, label="precision recall curve")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="best")

**Q4** Train Ridge Regression Model on the training set, and evaluate
1. Now, Create a Ridge Regression passing ```alpha = 100, normalize = True to Ridge()```
2. Fit the model with the train data
3. Predict the values with the test data
4. Print the coefficients of the model
5. Calculate the test MSE 
6. Get the score from the model using test data
7. Plot Precision-Recall Curve from the true & predicted test data

**A4** Replace ??? with code in the code cell below

In [None]:
ridge3 = Ridge(alpha = 100, normalize = True)
ridge3.fit(xtrain, ytrain)             
pred3 = ridge3.predict(xtest)  


#print model coefficients      
print(pd.Series(ridge3.coef_, index = df.columns[0:11])) 

mse = mean_squared_error(ytest, pred3)       
print("Test mean squared error (MSE): {:.2f}".format(mse))

#print score
print(ridge3.score)

In [None]:
#pass necessary parameters to precision_recall_curve method  
precision, recall, thresholds = precision_recall_curve(ytest, pred1)


# find threshold closest to zero
close_zero = np.argmin(np.abs(thresholds))

plt.plot(precision[close_zero], recall[close_zero], 'o', markersize=10,
         label="threshold zero", fillstyle="none", c='k', mew=2)

plt.plot(precision, recall, label="precision recall curve")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="best")

**Q5** Train Ridge Regression Model on the training set, and evaluate
1. Now, Create a Ridge Regression passing ```alpha = 0, normalize = True to Ridge()```
2. Fit the model with the train data
3. Predict the values with the test data
4. Print the coefficients of the model
5. Calculate the test MSE 
6. Get the score from the model using test data
7. Plot Precision-Recall Curve from the true & predicted test data

**A5** Replace ??? with code in the code cell below

In [None]:
ridge4 = Ridge(alpha = 0, normalize = True to Ridge())
ridge4.fit(xtrain, ytrain)             
pred4 = ridge4.predict(xtest)


#print model coefficients      
print(pd.Series(ridge4.coef_, index = df.columns[0:11])) 

mse = mean_squared_error(ytest, pred3)  
print("Test mean squared error (MSE): {:.2f}".format(mse))

#print score
print(ridge4.score)

In [None]:
#pass necessary parameters to precision_recall_curve method  
precision, recall, thresholds = precision_recall_curve(ytest, pred4)

# find threshold closest to zero
close_zero = np.argmin(np.abs(thresholds))

plt.plot(precision[close_zero], recall[close_zero], 'o', markersize=10,
         label="threshold zero", fillstyle="none", c='k', mew=2)

plt.plot(precision, recall, label="precision recall curve")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="best")

**Q6** **Study the codes above and answer the following questions:***

1. Why when passing alpha = 100, the coefficients become very small? 
2. Does alpha = 4 improve the MSE comparing to regular least squares?
3. How the size of alphas affects MSE and the Score of the models? 


**A6** Your answers:

1. Couldn't get it to work. Gonna be honest.

2.

3.


# Lasso Regularization

**Q7 Create a Lasso Regression passing ```max_iter = 10000, normalize = True to Lasso()```**

1. se the alphas from the 2nd question for setting parameters in Lasso
2. Fit the model with the train data
3. Predict the values with the test data
4. Print the coefficients of the model
5. Calculate the test MSE 
6. Get the score from the model using test data
7. Plot Precision-Recall Curve from the true & predicted test data


**A7** 

Replace ??? with code in the code cell below


In [None]:
#Lasso regression
lasso =Lasso(max_iter = 10000, normalize = True)
coefs = []

for a in alphas:
    lasso.set_params(alpha=a)
    lasso.fit(xtrain, ytrain)
    coefs.append(lasso.coef_)

np.shape(coefs) 

pred1_lasso = lasso.predict(xtest)       
print(pd.Series(lasso.coef_, index = df.columns[0:11]))           # Print coefficients
mse = mean_squared_error(ytest, pred1_lasso)            

print("Test mean squared error (MSE): {:.2f}".format(mse))

#Print the score
print(lasso.score)

In [None]:
#pass necessary parameters to precision_recall_curve method  
precision, recall, thresholds = precision_recall_curve(ytest, pred1_lasso)

# find threshold closest to zero
close_zero = np.argmin(np.abs(thresholds))

plt.plot(precision[close_zero], recall[close_zero], 'o', markersize=10,
         label="threshold zero", fillstyle="none", c='k', mew=2)

plt.plot(precision, recall, label="precision recall curve")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="best")


**Q8 Observe the lasso.coef and ridge.coef, and discuss the findings below**


**A8**  Your answer goes here: 