## SP23:CS 477/577: Python for Machine Learning

### HW2: Use logistic regression for classification


Student Name: Martin, Taylor

### 1. Preliminary

#### 1.1 Problem formulation
1. Training set. Data sampels: X_train = {$x_1, x_2, ..., x_i, ..., x_m$}. Target (class labels):Y_train = {$y_1, y_2, ..., y_i, ..., y_m$}
    
2. Predicted class probability in logistic regression:

\begin{equation*} 
    \hat{y_i} = 1/(1+exp-(w_0 + w_1*x_{i,1} + ,..., +w_n*x_{i,n}))
\end{equation*}    
    where $w_0, w_1,...,w_n$ are the linear model parameters, and $ x_{i,1}..., x_{i,n}$ are the features of the ith data feature vector $x_i$

3. Cost/loss function of the logistic regression: the criterion to quantitatively evaluate how good the current model is (the less the better).
\begin{equation*}
    J(w) = -C [\sum_{i=1}^m (y_i \times log(\hat{y_i}) + (1-y_i) \times log(1-\hat{y_i}))] 
\end{equation*}
    - In the above equation, $\hat{y_i}$  and  $y_i$ are the predicted class lable and the true class label of data sample $x_i$, respectively.
    - P is the penalty/regularizer, and C controls the contribution of P. There are three typical options of P: L1 norm ($ \|w\|_1$), L2 norm ($\frac{1}{2}w^T w$) and elstic-Net($\frac{1 - \rho}{2}w^T w + \rho \|w\|_1$)
    
4. Model training is to use an 'optimization algorithm' to find the best $w$ that can minimize the loss function $J(w)$. Refer to the gradient descent algorithm in the Other Learning materials for more information.

#### 1.2. The sklearn.linear_model.LogisticRegression class

1. class introduction: Key attributes (e.g., C, and penalty)and methods (fit, predict, predict_proba, and score)
    - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#
    
2. source code
    - https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/linear_model/_logistic.py#L1191

### 2. Breast Cancer Detection

Develop the logistic regression approach to classify breast cases into malignat(cancer) and benign.

Dataset

    :Number of data samples: 569
    :Number of features: 30 numeric. The first 10 features were directly calculated using mean feautues of all nuclei in an image
    :Class labels (We changed the original class labels by using 1 - y.)
        : Malignant: 1 (malignant or cancer)
        : Benign: 0  (benign or non-cancer)  
    https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset
    
    

#### 2.1 Prepare dataset

In [160]:
import sklearn.datasets as ds
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

# 1) data preparation
X,y = ds.load_breast_cancer(return_X_y=True)
y = 1-y
print("Dataset:", X.shape, y.shape)

rs = 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = rs)
print('training samples', X_train.shape, 'test samples', X_test.shape)
print("X_Test:", y_train)

Dataset: (569, 30) (569,)
training samples (455, 30) test samples (114, 30)
X_Test: [0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0 0
 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0
 1 0 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0
 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0
 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1
 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1
 1 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0
 1 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 1
 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1
 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1
 1 1 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
 0 0 0 0 0 1 1 1

#### 2.2 Train and explore a logistic regression model. 25 points

In [161]:
#1) train a logistic regression model. 5 points

clf = LogisticRegression(penalty='none' , random_state = 0, max_iter=1000, verbose=2)

#2) print out the optimal model(linear) parameters after the training. 5 points
            #w0
            #w1...n

clf.fit(X_train, y_train)

#3) calculate and print out the predicted class probabilities of 
#the first 5 training samples by implementing the equation in 1.1 (y_i^hat). 10 points

y_prediction = clf.predict(X_train[:5])

print("y_prediction:", y_prediction)

#4) calculate and print out the predicted class probabilities of 
#the first 5 training samples using the predict_proba() function. 5 points

clf.predict_proba(X_train[:5])

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =           31     M =           10
 This problem is unconstrained.

At X0         0 variables are exactly at the bounds


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.



At iterate    0    f=  3.15382D+02    |proj g|=  3.77194D+04

At iterate    1    f=  3.10664D+02    |proj g|=  1.15372D+04

At iterate    2    f=  3.06501D+02    |proj g|=  1.02634D+04

At iterate    3    f=  2.80457D+02    |proj g|=  4.26347D+04

At iterate    4    f=  2.53969D+02    |proj g|=  5.13566D+04

At iterate    5    f=  2.18351D+02    |proj g|=  4.45593D+04

At iterate    6    f=  1.83743D+02    |proj g|=  2.98168D+04

At iterate    7    f=  1.58099D+02    |proj g|=  1.80991D+04

At iterate    8    f=  1.41681D+02    |proj g|=  9.94303D+03

At iterate    9    f=  1.32666D+02    |proj g|=  4.51734D+03

At iterate   10    f=  1.28764D+02    |proj g|=  1.83151D+03

At iterate   11    f=  1.27171D+02    |proj g|=  1.63943D+03

At iterate   12    f=  1.26023D+02    |proj g|=  2.97649D+03

At iterate   13    f=  1.23978D+02    |proj g|=  3.21946D+03

At iterate   14    f=  1.23674D+02    |proj g|=  7.40594D+03

At iterate   15    f=  1.19406D+02    |proj g|=  6.24908D+03

At iter

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s finished


array([[9.99364221e-01, 6.35778942e-04],
       [9.49038689e-01, 5.09613106e-02],
       [9.75571446e-01, 2.44285535e-02],
       [9.99977918e-01, 2.20819985e-05],
       [9.51238089e-01, 4.87619107e-02]])

#### 2.3 Evaluation 1.0. 25 points

1) Accuracy. 5 points

In [162]:
# calculate and print out the accuracy of the trained model on the training set
# using the score function

training_accuracy = clf.score(X_train, y_train)
print("training_accuracy:", training_accuracy)

# calculate and print out the accuracy of the trained model on the test set
# using the score function
testing_accuracy = clf.score(X_test, y_test)
print("testing_accuracy:", testing_accuracy)

training_accuracy: 0.9758241758241758
testing_accuracy: 0.9649122807017544


2) Confusion matrix. 5 points

In [163]:
from sklearn.metrics import confusion_matrix
#print out the confusion matrix (cfm) for the trained model on the test set

y_prediction_X_test = clf.predict(X_test)
cfm = confusion_matrix(y_test, y_prediction_X_test)
print("cfm:", cfm)
#

cfm: [[64  3]
 [ 1 46]]


3) What is your observation from the cfm? 5 points

Response (add your response here): 
   1) This model presented to us could be unbalanced. Each row of the cfm does not total 1. 
   
   2) An unbalanced cfm could effect our prediction data. Making some predictions appear to be more or less accurate than they acctually are. 
   
   ...

4) Recall and Precision. 10 points

In [164]:
#print out the recall ratio of the trained model for 
#detecting breast cancer using the above cfm
print("recall ratio:", cfm[1][1]/(cfm[1][1]+cfm[1][0]))

# print out the precision of the trained model for 
#detecting breast cancer using the above cfm
print("precision:", cfm[1][1]/(cfm[1][1]+cfm[0][1]))

recall ratio: 0.9787234042553191
precision: 0.9387755102040817


####  2.4 Evaluation 2.0. 30 points

K-fold cross validation. (See Lecture Notes 10)

1) Why do we need K-fold cross validation in this application? 5 points

    Reponse(add your response here):
    
    We need K-Fold Cross Validation becuase this ensures that every observation from our original dataset has the possibility of appearing in our test and training sets. This is one of the best approaches we have if we are training on limited input data. This will allow us to evaluate our model by training several other models on our subsets of breast cancer data. With k-fold we can train our models further, to better our accuracy, error rate, precision, and recall rate. Providing less skewed and unbiased reults. 

    ...

2) Use the 5-fold cross validation to re-evaluate a logistic regression model. 15 points

In [165]:
#cross_validate in sklearn should be used in this task
from sklearn.model_selection import cross_validate as cv

myLR_1 = LogisticRegression(C = 5, penalty = 'l1', solver = 'saga', random_state = rs)

#call the cross_validate function and set scoring = ('accuracy', 'recall')

scores = cv(myLR_1, X_train, y_train, cv=5, scoring = ('accuracy', 'recall'))
#print out the test accuracy for each fold and print out the mean accuracy 
print("Cross Validation Scores:", scores)
mean_accuracy = scores['test_accuracy'].mean()
print("Mean Accuracy:", mean_accuracy)


#print out the test reacll ratios for each fold and print out the mean recall ratio 
mean_recall = scores['test_recall'].mean()
print("Mean Recall:", mean_recall)



Cross Validation Scores: {'fit_time': array([0.0829823 , 0.10404181, 0.06411576, 0.04357791, 0.03432107]), 'score_time': array([0.00606537, 0.00308204, 0.00722957, 0.00536847, 0.00332713]), 'test_accuracy': array([0.94505495, 0.87912088, 0.91208791, 0.93406593, 0.89010989]), 'test_recall': array([0.87878788, 0.72727273, 0.78787879, 0.81818182, 0.75757576])}
Mean Accuracy: 0.9120879120879121
Mean Recall: 0.793939393939394




3) What are the differences between Evaluation 1.0 and Evaluation 2.0? Which one is better for this application? and why? 10 points

Response (Add your responses here):

(1) Evaluation 1 had a unbalanced confusion matrix, giving us skewed and biased results.

(2) Evaluation 2 is better for this application. after using k-fold cross validation this provided less skewed and less bias results. resulting in a more accurate and precise model. 

...




#### 2.5 How can we improve the recall ratio?  20 points.

Achieving high recall ratio(RR) is very important is cancer detection. Because RR indicates a model's ability to identify cancers; and low recall ratio may cost the lives of patients. Can you help improve the logistic regression approach to increse the recall ratio?

In [166]:
#train a logistic regression model.
myLR_2 = LogisticRegression(C = 5, penalty = 'l1', solver = 'saga', random_state = rs)
myLR_2.fit(X_train, y_train)



1) Print out the predicted probabilities of all misclassified samples (test set) and print out their class labels. 5 points

In [167]:
print("Predicted Probabilities:", myLR_2.predict_proba(X_test))
print("Predicted Classes:", myLR_2.classes_)

Predicted Probabilities: [[2.73867407e-01 7.26132593e-01]
 [8.13192224e-01 1.86807776e-01]
 [8.34775552e-01 1.65224448e-01]
 [5.89012144e-01 4.10987856e-01]
 [9.07320396e-01 9.26796039e-02]
 [7.64689377e-01 2.35310623e-01]
 [8.95612289e-01 1.04387711e-01]
 [8.33574737e-01 1.66425263e-01]
 [7.07165535e-01 2.92834465e-01]
 [8.40084890e-01 1.59915110e-01]
 [6.91168002e-01 3.08831998e-01]
 [7.21649302e-01 2.78350698e-01]
 [8.50538882e-01 1.49461118e-01]
 [7.40201208e-01 2.59798792e-01]
 [6.81537081e-01 3.18462919e-01]
 [1.22685011e-01 8.77314989e-01]
 [7.93319056e-01 2.06680944e-01]
 [2.12138899e-02 9.78786110e-01]
 [8.30922498e-01 1.69077502e-01]
 [3.65631951e-03 9.96343680e-01]
 [3.57413943e-02 9.64258606e-01]
 [5.69413939e-01 4.30586061e-01]
 [7.65963543e-01 2.34036457e-01]
 [7.08272737e-01 2.91727263e-01]
 [8.56672866e-01 1.43327134e-01]
 [6.08621215e-01 3.91378785e-01]
 [6.95230836e-01 3.04769164e-01]
 [8.18809845e-01 1.81190155e-01]
 [7.69241537e-01 2.30758463e-01]
 [6.96834694e-04 9

2) Define a new prediction function that use threshod $th=0.37$ to determine the final class label (0 0r 1). 10 points

Tips: if the predict_proba function returns a probability that is less than $th$ the new prediction function returns 0, otherwise returns 1.

In [168]:
def myPredict(lr, X, th = 0.35):
    '''  
        input:
            lr: a trained logistic regression model
            X: input feature vectors
            th: threshold. 

        return predicted binary labels for all inuput samples
    '''    
    #add your code here
    pred_b = np.zeros(len(X))
    pred = lr.predict_proba(X)
    for i in range(len(pred)):
        if pred[i][1] >= th:
            pred_b[i] = 1
        else:
            pred_b[i] = 0

    #
    return pred_b

y_pred = myPredict(myLR_2, X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

#print out the accuracy and the recall ratio on the test set
print("Accuracy:", cm[0][0]+cm[1][1]/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1]))
print("Recall:", cm[1][1]/(cm[1][1]+cm[1][0]))

[[61  6]
 [ 7 40]]
Accuracy: 61.35087719298246
Recall: 0.851063829787234


3) Other possible ideas could be applied to improve the recall ratio of the logistic regression. 5 points

Response:

...