## SP23:CS 477/577: Python for Machine Learning

### HW2: Use logistic regression for classification


Student Name: Last, First

### 1. Preliminary

#### 1.1 Problem formulation
1. Training set. Data sampels: X_train = {$x_1, x_2, ..., x_i, ..., x_m$}. Target (class labels):Y_train = {$y_1, y_2, ..., y_i, ..., y_m$}
    
2. Predicted class probability in logistic regression:

\begin{equation*} 
    \hat{y_i} = 1/(1+exp-(w_0 + w_1*x_{i,1} + ,..., +w_n*x_{i,n}))
\end{equation*}    
    where $w_0, w_1,...,w_n$ are the linear model parameters, and $ x_{i,1}..., x_{i,n}$ are the features of the ith data feature vector $x_i$

3. Cost/loss function of the logistic regression: the criterion to quantitatively evaluate how good the current model is (the less the better).
\begin{equation*}
    J(w) = -C [\sum_{i=1}^m (y_i \times log(\hat{y_i}) + (1-y_i) \times log(1-\hat{y_i}))] 
\end{equation*}
    - In the above equation, $\hat{y_i}$  and  $y_i$ are the predicted class lable and the true class label of data sample $x_i$, respectively.
    - P is the penalty/regularizer, and C controls the contribution of P. There are three typical options of P: L1 norm ($ \|w\|_1$), L2 norm ($\frac{1}{2}w^T w$) and elstic-Net($\frac{1 - \rho}{2}w^T w + \rho \|w\|_1$)
    
4. Model training is to use an 'optimization algorithm' to find the best $w$ that can minimize the loss function $J(w)$. Refer to the gradient descent algorithm in the Other Learning materials for more information.

#### 1.2. The sklearn.linear_model.LogisticRegression class

1. class introduction: Key attributes (e.g., C, and penalty)and methods (fit, predict, predict_proba, and score)
    - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#
    
2. source code
    - https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/linear_model/_logistic.py#L1191

### 2. Breast Cancer Detection

Develop the logistic regression approach to classify breast cases into malignat(cancer) and benign.

Dataset

    :Number of data samples: 569
    :Number of features: 30 numeric. The first 10 features were directly calculated using mean feautues of all nuclei in an image
    :Class labels (We changed the original class labels by using 1 - y.)
        : Malignant: 1 (malignant or cancer)
        : Benign: 0  (benign or non-cancer)  
    https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset
    
    

#### 2.1 Prepare dataset

In [3]:
import sklearn.datasets as ds
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
import sklearn
print(sklearn.__version__)

# 1) data preparation
X, y = ds.load_breast_cancer(return_X_y=True)
y = 1-y
print(X.shape, y.shape)

rs = 0
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=rs)
print('training samples', X_train.shape, 'test samples', X_test.shape)


1.2.1
(569, 30) (569,)
training samples (455, 30) test samples (114, 30)


#### 2.2 Train and explore a logistic regression model. 25 points

In [4]:
# 1) train a logistic regression model. 5 points
model = LogisticRegression(
    random_state=rs, max_iter=3000).fit(X=X_train, y=y_train)


# 2) print out the optimal model(linear) parameters after the training. 5 points
# w0
# w1...n
params = model.coef_
print(params)

# 3) calculate and print out the predicted class probabilities of
# the first 5 training samples by implementing the equation in 1.1 (y_i^hat). 10 points


# 4) calculate and print out the predicted class probabilities of
# the first 5 training samples using the predict_proba() function. 5 points
probablity = model.predict_proba(X_train[:5])
print(probablity)


[[-0.70193886 -0.18897771  0.18942047 -0.02287614  0.14411881  0.19150739
   0.37130293  0.20725414  0.26954863  0.03052996  0.04613319 -1.02791655
   0.03333033  0.10636898  0.01239678 -0.04904355  0.02211939  0.02255294
   0.03059153 -0.01346839 -0.26372385  0.41354888  0.19954951  0.0110381
   0.25778885  0.68276195  1.22299877  0.44846203  0.62294273  0.0968387 ]]
[[9.99605374e-01 3.94626224e-04]
 [9.66264322e-01 3.37356777e-02]
 [9.44933289e-01 5.50667111e-02]
 [9.99795932e-01 2.04068351e-04]
 [8.90069867e-01 1.09930133e-01]]


#### 2.3 Evaluation 1.0. 25 points

1) Accuracy. 5 points

In [5]:
# calculate and print out the accuracy of the trained model on the training set
# using the score function
print(model.score(X_train, y_train))


# calculate and print out the accuracy of the trained model on the test set
# using the score function
print(model.score(X_test, y_test))


0.9626373626373627
0.9473684210526315


2) Confusion matrix. 5 points

In [10]:
from sklearn.metrics import confusion_matrix
# print out the confusion matrix (cfm) for the trained model on the test set
pred = model.predict(X_test)
cfm = confusion_matrix(y_true=y_test, y_pred=pred)
tn, fp, fn, tp = cfm.ravel()
print(cfm)
#


[[62  5]
 [ 1 46]]


3) What is your observation from the cfm? 5 points

Response (add your response here): 

   1) it tends to predicit benign when malignant more than maliganant when benign
   
   2) It has a pretty high overall accuracy
   
   ...

4) Recall and Precision. 10 points

In [12]:
# print out the recall ratio of the trained model for
# detecting breast cancer using the above cfm
Recall = (tp)/(tp+fn)
print(Recall)

# print out the precision of the trained model for
# detecting breast cancer using the above cfm
Precision = (tp)/(tp+fp)
print(Precision)


0.9787234042553191
0.9019607843137255


####  2.4 Evaluation 2.0. 30 points

K-fold cross validation. (See Lecture Notes 10)

1) Why do we need K-fold cross validation in this application? 5 points

    Reponse(add your response here):
    
    K-Fold will allow us to see how sensative our model is to changes in the training data

2) Use the 5-fold cross validation to re-evaluate a logistic regression model. 15 points

In [21]:
# cross_validate in sklearn should be used in this task
from sklearn.model_selection import cross_validate as cv
from sklearn.model_selection import cross_val_score

myLR_1 = LogisticRegression(
    C=5, penalty='l1', solver='saga', random_state=rs, max_iter=4000)

# call the cross_validate function and set scoring = ('accuracy', 'recall')
k = 5
crossVal = cv(myLR_1, X, y, cv=k, scoring=('accuracy', 'recall'))

# print out the test accuracy for each fold and print out the mean accuracy
print("Accuracy", crossVal["test_accuracy"])

# print out the test reacll ratios for each fold and print out the mean recall ratio
print("recall", crossVal["test_recall"])


Accuracy [0.90350877 0.92982456 0.90350877 0.95614035 0.92035398]
recall [0.76744186 0.88372093 0.83333333 0.9047619  0.88095238]


3) What are the differences between Evaluation 1.0 and Evaluation 2.0? Which one is better for this application? and why? 10 points

Response (Add your responses here):

(1) Evaluation 2 Allows us to see many diffrent combanations of test and training data to show us how our model perfroms in a best case situation. 

(2) Evaluation 2 also allows to see how the model would be in a worst case situation

...




#### 2.5 How can we improve the recall ratio?  20 points.

Achieving high recall ratio(RR) is very important is cancer detection. Because RR indicates a model's ability to identify cancers; and low recall ratio may cost the lives of patients. Can you help improve the logistic regression approach to increse the recall ratio?

In [29]:
# train a logistic regression model.

# Changed solver to 'newton-cholesky'
# changed penalty to 'l2'
# improves both accuracy and recall rate
myLR_2 = LogisticRegression(
    C=5, penalty='l2', solver='newton-cholesky', random_state=rs, max_iter=4000)
myLR_2.fit(X=X_train, y=y_train)

# call the cross_validate function and set scoring = ('accuracy', 'recall')
k = 5
crossVal = cv(myLR_2, X, y, cv=k, scoring=('accuracy', 'recall'))

# print out the test accuracy for each fold and print out the mean accuracy
print("Accuracy", crossVal["test_accuracy"])

# print out the test reacll ratios for each fold and print out the mean recall ratio
print("recall", crossVal["test_recall"])


Accuracy [0.93859649 0.93859649 0.97368421 0.94736842 0.96460177]
recall [0.86046512 0.88372093 0.95238095 0.92857143 0.97619048]


1) Print out the predicted probabilities of all misclassified samples (test set) and print out their class labels. 5 points

In [34]:
preds = myLR_2.predict(X_test)
prob_preds = myLR_2.predict_proba(X_test)

for i in range(preds.size ):
    if(preds[i] != y_test[i]):
        print("Inccorect guess:", preds[i])
        print(prob_preds[i])
    

Inccorect guess: 1
[0.48036501 0.51963499]
Inccorect guess: 1
[0.37002772 0.62997228]
Inccorect guess: 1
[0.13756109 0.86243891]
Inccorect guess: 1
[0.40725988 0.59274012]
Inccorect guess: 0
[0.75600375 0.24399625]
Inccorect guess: 1
[0.2075175 0.7924825]


2) Define a new prediction function that use threshod $th=0.37$ to determine the final class label (0 0r 1). 10 points

Tips: if the predict_proba function returns a probability that is less than $th$ the new prediction function returns 0, otherwise returns 1.

In [54]:
def myPredict(lr, X, th=0.35):
    '''  
        input:
            lr: a trained logistic regression model
            X: input feature vectors
            th: threshold. 

        return predicted binary labels for all inuput samples
    '''
    probs = lr.predict_proba(X)
    pred_b = np.empty(probs.size//2)

    for i, pred in enumerate(probs):
        if(pred[0] >= th):
            pred_b[i] = 0
        else:
            pred_b[i] = 1

    #
    return pred_b


y_pred = myPredict(myLR_2, X_test, th=0.35)
print(y_pred.size)
print(y_test.size)
cm = confusion_matrix(y_true= y_test, y_pred= y_pred)
print(cm)

# print out the accuracy and the recall ratio on the test set


114
114
[[65  2]
 [ 1 46]]


3) Other possible ideas could be applied to improve the recall ratio of the logistic regression. 5 points

Response:
    We could probably change the solver to something better
    modify the penalties
    Change the threshold even more

...