# Breast Cancer Wisconsin Data Set

Create a predictive model that classifies benign vs. malignant tumors. 
See https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/data for data understanding.

## Import packages

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

sns.set()

## Load and inspect data set 

In [None]:
original_data = pd.read_csv("data.csv")
original_data.head()

In [None]:
original_data.describe(include="all")  # descriptive statistics for all columns

In [None]:
original_data.isnull().sum()  # check for null values

In [None]:
original_data[original_data.duplicated(keep=False)]  # check for duplicate rows

There are no missing values and no duplicates, so you don't have to take actions here. 

## Inspect features

In [None]:
original_data[["radius_mean", "diagnosis"]].groupby(
    ["diagnosis"], as_index=False
).mean().sort_values(by="diagnosis", ascending=False)

Inspect more feature, e.g. texture, perimeter,... 

In [None]:
# your code

An important step during feature selection is removing features that strongly correlate with each other. You keep only one feature as "representer" of the information and remove redundant features. There are more advanced methods to do this but, for now, just look at the correlation map and decide which features to keep.

In [None]:
f, ax = plt.subplots(figsize=(18, 18))
sns.heatmap(original_data.corr(), annot=True, linewidths=0.5, fmt=".1f", ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title("Correlation Map")
plt.show()

## Select predictors

In [None]:
data_reduced_features = original_data[["<your feature 1>", "<your feature 2>", "..."]]

In [None]:
data_reduced_features.head()

Once again, have a look at the correlation map and remove more features if necessary. 

In [None]:
f, ax = plt.subplots(figsize=(18, 18))
sns.heatmap(data_reduced_features.corr(), annot=True, linewidths=0.5, fmt=".1f", ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title("Correlation Map")
plt.show()

## Prepare for modeling

Set X and y (predictors and target) according to your dataframe:

In [None]:
target = data_reduced_features['<your target column>']
predictors = # your code

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    predictors, target, test_size=0.2, random_state=123
)  # 80-20 split into training and test data

Check if the dataset is balanced.

In [1]:
# your code

Use StandardScaler to scale your predictors (fit on training set and transform training and test set):

In [None]:
scaler = # your code 
# your code

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Classification models and evaluation metrics

Create a decision tree classifier: 

In [None]:
# your code
# your code

print("train performance")
print(classification_report(y_train, tree.predict(X_train)))
print("test performance")
print(classification_report(y_test, tree.predict(X_test)))

How do you evaluate this result (hints: overfitting vs. underfitting, which metric might be important for the use case and why)? 

In [None]:
conf_mat = confusion_matrix(y_test, tree.predict(X_test))
df_cm = pd.DataFrame(
    conf_mat,
    index=["B", "M"],
    columns=["B", "M"],
)
fig = plt.figure(figsize=[10, 7])
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, ha="right", fontsize=14
)
heatmap.xaxis.set_ticklabels(
    heatmap.xaxis.get_ticklabels(), rotation=45, ha="right", fontsize=14
)
plt.ylabel("True label")
plt.xlabel("Predicted label")

Create a logistic regression model: 

In [None]:
# your code
# your code
print("train performance")
print(classification_report(y_train, logreg.predict(X_train)))
print("test performance")
print(classification_report(y_test, logreg.predict(X_test)))

How do you evaluate this result? 

In [None]:
conf_mat = confusion_matrix(y_test, logreg.predict(X_test))
df_cm = pd.DataFrame(
    conf_mat,
    index=["B", "M"],
    columns=["B", "M"],
)
fig = plt.figure(figsize=[10, 7])
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
heatmap.yaxis.set_ticklabels(
    heatmap.yaxis.get_ticklabels(), rotation=0, ha="right", fontsize=14
)
heatmap.xaxis.set_ticklabels(
    heatmap.xaxis.get_ticklabels(), rotation=45, ha="right", fontsize=14
)
plt.ylabel("True label")
plt.xlabel("Predicted label")

Feel free to try out more classifiers (don't forget to import required packages!), change classifier parameters, modify train and test split, select other predictors,...