## Import Libraries

In [1]:
import pandas as pd

## Data

In [2]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

df = pd.DataFrame(data.data, columns = data.feature_names)
df['target'] = data['target']

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

All our features are numerical, so we don't need to convert any of them. Also, we have no null values since there are 569 total entries and all features have 569 non-null entries, so we don't need to do any imputation. This is a pretty clean data set!

In [3]:
df['target'].value_counts()

1    357
0    212
Name: target, dtype: int64

Data is slightly unbalanced, but for this exercise, it's okay.


##Split Data

In [4]:
# assign inputs and output

X = df.drop('target', axis=1)
y = df['target']

In [5]:
# randomly split data into train and test dataframes.
# 30% of the data will be in the test dataframe and 70% will go into the train set.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

## Train & Predict

In [6]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=3000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

## Evaluation Metrics

In [7]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print('accuracy on test set: ', accuracy_score(y_test, predictions))
print('\nconfusion matrix:\n\n', confusion_matrix(predictions, y_test))
print('\nclassification report\n\n', classification_report(predictions, y_test))


accuracy on test set:  0.9415204678362573

confusion matrix:

 [[ 59   3]
 [  7 102]]

classification report

               precision    recall  f1-score   support

           0       0.89      0.95      0.92        62
           1       0.97      0.94      0.95       109

    accuracy                           0.94       171
   macro avg       0.93      0.94      0.94       171
weighted avg       0.94      0.94      0.94       171

