In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Your own logistic regression model

This time it's your turn to create a logistic regression model. The [dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), that we're going to work with, contains diagnostic information about breast cancer. The medical features of individual tumors like size, shape and smoothness were measured. The tumors are labeled as **malignant**(0) or **benign**(1).

In [None]:
data = load_breast_cancer(as_frame=True)
features = data["data"]
feature_names = data["feature_names"]
labels = data["target"]
label_names = data["target_names"]

print(features.columns)
print(labels[15:20])
print(label_names)

### Data exploration

Let's have a closer look on the distribution of the data. How many tumors are there in the dataset overall? Also find out how many tumors were classified as malignant and how many as benign. Is this a well balanced dataset or is one kind overrepresented?

Assign the amount of malignant tumors to a variable called `n_malignant` and the amount of benign tumors to a variable `n_benign` to pass the tests in the following test cell.

In [None]:


n_malignant = None
n_benign = None

In [None]:
# Test cell

assert n_malignant == 53*4, f"n_malignant should be {53*4}"
assert n_benign == 119*3, f"n_benign should be {119*3}"


Next we should deal with the features and decide if we have to preprocess them or use them as they are. What are the datatypes of the features? If they are non-numerical we need to convert them to quantify them. Are there any missing values or NaNs(not-a-number)?

Assign the variables `need_to_convert_non_numerical` and `need_to_convert_nan` to `True` or `False` depending on the need to deal with this data type within this dataset to pass the test cell.

In [None]:


need_to_convert_non_numerical = None
need_to_convert_nan = None

In [None]:
# Test cell

assert (not need_to_convert_non_numerical) and (need_to_convert_non_numerical is not None), "Which column contains non_numerical values? How do you know?"
assert (not need_to_convert_nan) and (need_to_convert_nan is not None), "Which column contains NAN values? How do you know?"

Have a look at a specific column. What is the mean and the standard deviation of the "mean radius" of the tumors?
Can you test if there is a correlation between "mean radius", "mean perimeter" and "mean area"?

Hint: You can access a sub-dataframe of a pandas-dataframe by giving it a list of columns as its index:

```my_subframe = my_dataframe[["column1","column2","column3"]]```

Assign the mean and the standard_deviation of the "mean radius" column to variables called `mean_radius_mean` and `mean_radius_std` to pass the test cell. Also assign a `variable geometry_properties_correlated` to `True` or `False` depending on if you found a the mean radius", "mean perimeter" and "mean area" column to be strongly correlated.

In [None]:
mean_radius_mean = None
mean_radius_std = None


geometry_properties_correlated = None

In [None]:
# Test cell

assert np.isclose(mean_radius_mean*mean_radius_std, 49.785265873530975, rtol=1e-03, atol=1e-04), "One or both of the mean and the std of mean_radius seem to be off. Did you pick the right column?"

assert geometry_properties_correlated, "Why wouldn't radius, diameter and area be correlated?"


## Data preparation

To validate your model in the end you will need a separate test set. Therefore you should split your data in two random subsets for training and testing now. Your test set should contain 15% of your total dataset. Also make sure, that both your subsets have the expected sample size.

Assign features and labels of your subsets to the variables `X_train, X_test, y_train, y_test`. To pass the test cell the sample size and amount of features has to be correct.

In [None]:
X_train, X_test, y_train, y_test = None, None, None, None

In [None]:
# Test cell

assert y_train.shape == (483,), "Your training set does not contain 85% of the data"
assert y_test.shape == (86,), "Your test set does not contain 15% of the data"

assert X_train.shape[0] == 483, "Your training set does not contain 85% of the data"
assert X_test.shape[0] == 86, "Your test set does not contain 15% of the data"
assert len(X_train.shape) == 2, "Your training set is not a 2D-Matrix"
assert len(X_test.shape) == 2, "Your test set is not a 2D-Matrix"
if (len(X_train.shape) == 2) and (len(X_test.shape) == 2):
    assert X_train.shape == (483, 30), "Your training set does not contain all 30 features of the dataset"
    assert X_test.shape == (86, 30), "Your test set does not contain all 30 features of the dataset"

Standardize the training data and the test data with the mean and the standard deviation of the training data.

Assign the scaled values to variables called `X_train_scaled` and `X_test_scaled`. To pass the test cell the mean and standard deviation of `X_train_scaled` should be very close 0 and 1, while mean and standard deviation of `X_test_scaled` has to have a little deviation from 0 and 1.

In [None]:


X_train_scaled = None
X_test_scaled = None

In [None]:
# Test cell

assert np.allclose(X_train_scaled.mean(axis=0), np.zeros(shape=(30,))), "The mean of all columns of X_train_scaled is not close to 0"
assert np.allclose(X_train_scaled.std(axis=0), np.ones(shape=(30,))), "The standard deviation of all columns of X_train_scaled is not close to 0"

assert not np.allclose(X_test_scaled.mean(axis=0), np.zeros(shape=(30,))), "The mean of all columns of X_test_scaled is suspiciously close to 0"
assert not np.allclose(X_test_scaled.std(axis=0), np.ones(shape=(30,))), "The standard deviation of all columns of X_test_scaled is suspiciously close to 0"


## Model training

Now it's finally time to create your LogisticRegression model and fit it to the data in your training-set.

Evaluate your model by a metric of your choice

Finally visualize the results of your prediction in a confusion-matrix. Don't forget, that labels you want to display, are is the list ["malignant", "benign"]. This list still saved in the variable `label_names`.

Right now we classify the tumors straight up according to which one has the higher probability, the decision-treshold is 0.5, which makes sense in most cases. In other cases we can improve the model by adjusting the decision-threshold, e.g. a tumor has to have a probability of 90% to be classified as benign otherwise we will classify it as malignant. This might even make sense, if it lowers the accuracy of the model, especially in this example. Why?

The next code cell offers a way to play around with different decision thresholds. Call the `test_different_thresholds` function with your variable and a threshold list of your choice(```my_function(my_argument1, my_argument2,...)```) to have a look at how the threshold affects the prediction outcome.

In [None]:
threshold_list = [0.1, 0.3, 0.5, 0.7, 0.9]

def test_different_thresholds(model, X_test_scaled, y_test, threshold_list):
    pred_proba = model.predict_proba(X_test_scaled)
    
    for threshold in threshold_list:
        print (f"\n******** For threshold ={threshold: .1f} ******")
        Y_test_pred = (pred_proba[:,1] > threshold).astype(int)
        test_accuracy = metrics.accuracy_score(y_test, Y_test_pred)
        print(f"Our testing accuracy is {test_accuracy: .2f}")

        print(confusion_matrix(y_test, Y_test_pred))
        


**Congratulations on building your own logistic regression model!**