In this small project we will use the logistic regression to predict breast cancer using the dataset providad in sklearn

In [0]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# the next libraries are for metrics of the model
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score, roc_auc_score

In [2]:
# Importing the dataset
dataset = datasets.load_breast_cancer()
dataset.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [0]:
# A brief explication of the dataset
print(dataset.DESCR)

In [0]:
# For this case, all the information about the tumour is set on X
# y is 1 if the tumor is malignant and 0 if it is benign
X = dataset.data
y = dataset.target

In [0]:
# Dividing the data into train and test
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size =0.2)

In [0]:
# The next step is scaling the data in the same magnitude
# because some variables are in gr, ml, mm
# They are in a differente type of measure
# We use fit_transform just in the train_data beacause we want to 
# normalize that value, if we use it in test too the model will have a 
# doble vectorization

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [0]:
# Calling the model
log = LogisticRegression()

In [18]:
# Training the model
log.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
# Making predictions with the test data
y_pred = log.predict(x_test)

In [23]:
# Now it's time to see how many good predictions has the model 
# To do that we will use a confusion matrix
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

[[33  4]
 [ 0 77]]




1.   33 values where predicted as malignant cancer and indeed the data tell us that they are cancer.
2.   77 observations where predicted as benign and effectively the data is showed as benign
3.   4 cases where observated as benign and in real they are malignant
4.   0 cases diagnosticated as malignant and in real are benign.


In [55]:
# the intercept of our model
log.intercept_

array([0.00652354])

In [54]:
# The coef of our model
log.coef_

array([[-0.31298084, -0.48109077, -0.32639824, -0.38086679, -0.31257755,
         0.41326128, -1.02756373, -1.01200474,  0.23797808,  0.298702  ,
        -1.19555254,  0.20383072, -0.85581427, -0.86757443, -0.22583281,
         1.20896325, -0.03557877, -0.45911753,  0.3193126 ,  0.61196531,
        -0.9795635 , -0.90350727, -0.92216016, -0.92579619, -0.65991489,
         0.18123455, -0.91688204, -0.76679806, -0.8348285 , -0.48503981]])

$$y = \frac{0.00652354}{(1+e^{-x})} $$

Where x is the vector showed before

In [29]:
# Tests to the model
precision = precision_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision, accuracy, recall, f1

(0.9506172839506173, 0.9649122807017544, 1.0, 0.9746835443037974)

In [31]:
# ROC curve
roc_auc = roc_auc_score(y_test, y_pred)
roc_auc

0.9459459459459459

Our model is pretty good :)


In [51]:
# Making a simple prediction when all x values are ones
x_new = np.ones(30).reshape(1,-1)
y_pred_new = log.predict(x_new)
y_pred_new

array([0])

Our prediction tell us that the when the tomour has that values(ones) is benign. Remind that if the predict values is >= 0.5 is consider 1 and if it is <0.5 is consider as 0.