# Answer the following questions

In [None]:
from sklearn.datasets import load_breast_cancer
import numpy as np
x,y = load_breast_cancer(return_X_y = True, as_frame = True)
x.head()


### How was the data optained?

##### Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

#### Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, “Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

### How many classes are there?

In [None]:
y.unique()

### What does each row represent?

#### Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

### How many data points are there?

In [None]:
x.shape[0]

### How many features?

In [None]:
x.shape[1]

### Which kind of features are there?

#### The mean, standard error, and “worst” or largest (mean of the three worst/largest values) of these features were computed for each image, resulting in 30 features. For instance, field 0 is Mean Radius, field 10 is Radius SE, field 20 is Worst Radius.

### Which feature(s) have/has the highest absolut values?

In [None]:
x.describe().iloc[1,].sort_values(ascending=False)[:5]


##### The highest values are at the feature worst area, mean area, worst perimeter

### Just from the information present, do you expect high or low correlation?

In [None]:
x.corr()

#### Yes

# Classification

#### 1. Split data into a train and test set using a test set size of 15%.
#### 2. Scale the data.
#### 3. Train a KNeighborsClassifier and report the accuracy score on the test set.
#### Use nearest neighbors.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x ,y, train_size=0.85)

scaler = StandardScaler(copy=True)
xTrain_scaled = scaler.fit_transform(x_train, y_train)
minDis = KNeighborsClassifier(n_neighbors=7)
minDis.fit(xTrain_scaled, y_train)
xTest_scaled = scaler.transform(x_test)

minDis.score(xTest_scaled, y_test)

# Confusion Matrix

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(minDis, xTest_scaled, y_test, cmap='Blues')
plt.show()

In [None]:
manual_calc_score = (29 + 52)/(29+4+1+52)
manual_calc_score

### Benign data points are represented by which class number?

#### 1

### Malignant data points are represented by which class number?

#### 0

### How many data points are correctly classified as benign?

#### from the plot (1,1) --> 52

### How many data points are correctly classified as malignant?

#### from the plot (0,0) --> 29

### How many data points are classified as malignant, although being benign?

#### from the plot (0,1) --> 4

# Performance Measures

#### Given the output from the confusion matrix, compute precision, recall and F1 score by hand for both labels.

In [None]:
TP = 29
TN = 4
FP = 1
FN = 52
precision = TP / (TP + FP)
recall = TP / (TP + FN)
F1_score = 2 * (precision * recall)/(precision + recall)
print("precision=", precision)
print("recall=", recall)
print("F1-score=", F1_score)

#### Only then, use the methods implemented in scikit-learn for verification. Again, make sure to compute the values for both classes.

# Classification Report

#### This is not the same as the calulated values by hand because the methods do interpred TP, TN, FP, FN not the same 

#### The classification report provides a very good summary of all scores. By comparing to your own computed
#### values, make sure to be able to read and interpret the report.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, digits=4))