# This notebook explains logistic regression and how it is used for binary classsfication. 

## For the sake of this task, we will use the same dataset that we used in KNN- Breast Cancer prediction . As the dataset is about whether the cancer is miligant or benign i.e binary classification (1 or 0)

### we choose this dataset beacuse there is not much preprocessing required so it allow us to explain more about the logistic regression and we use our logit model to predict the cancer is benign or milgant.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

## Attribute Information:

1) ID number

2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)
i) symmetry

j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant


In [1]:
# Importing the relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split

In [2]:
# Loading our dataset: 
data =  pd.read_csv(r'Kaggle_BreastCancer.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
id                         569 non-null int64
diagnosis                  569 non-null object
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non

In [3]:
data.head(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
# Dropping the id and unnamed_32 columns as they are not very important.

data1 = data.copy()


data = data.drop(['id','Unnamed: 32'],axis=1)

In [5]:
# defiining our variable of interest,
# diagnosis is our target variable, which we have to predict and,
# the remaining are our features which are independent and we see there's no null values!
Y = data['diagnosis']

X = data.drop('diagnosis',axis=1)

In [6]:
X.shape , Y.shape

((569, 30), (569,))

In [7]:
from sklearn.linear_model import LogisticRegression

In [8]:
log_reg = LogisticRegression()

In [9]:
help(LogisticRegression)

Help on class LogisticRegression in module sklearn.linear_model._logistic:

class LogisticRegression(sklearn.base.BaseEstimator, sklearn.linear_model._base.LinearClassifierMixin, sklearn.linear_model._base.SparseCoefMixin)
 |  LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
 |  
 |  Logistic Regression (aka logit, MaxEnt) classifier.
 |  
 |  In the multiclass case, the training algorithm uses the one-vs-rest (OvR)
 |  scheme if the 'multi_class' option is set to 'ovr', and uses the
 |  cross-entropy loss if the 'multi_class' option is set to 'multinomial'.
 |  (Currently the 'multinomial' option is supported only by the 'lbfgs',
 |  'sag', 'saga' and 'newton-cg' solvers.)
 |  
 |  This class implements regularized logistic regression using the
 |  'liblinear' library, 'newton-cg', 's

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# splitting the dataset into train and test set with a ratio if 80:20 

# on train set we train our model and on test we test our predictions and how well our model has perfomed.

# defining random_state shuffles the data, each time when we excute the entire notebook the model will get new shuffled train & set data.
# It is very useful specially dealing with huge amount of data.

xtrain,xtest, ytrain,ytest  = train_test_split(X ,Y,random_state=204,test_size=0.02)

In [12]:
xtrain.shape , ytrain.shape , xtest.shape, ytest.shape

((557, 30), (557,), (12, 30), (12,))

## Building the Logistic Model.

In [13]:
# As explained earlier we will use Logistic Regression as our model and will se how well it fits the data.

# We also evaluate our model's precision and performance.

log_reg = LogisticRegression(solver='saga',penalty='l1')

# Coming back to logistic regression: 

# it has  several hypermaters such as 'penalty', 'solver', 'max_iter','verbose','fit_interept',etc.

# Fitting our model:

log_reg.fit(xtrain, ytrain)



LogisticRegression(penalty='l1', solver='saga')

In [14]:
xtrain

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
506,12.220,20.04,79.47,453.1,0.10960,0.11520,0.08175,0.02166,0.2124,0.06894,...,13.160,24.17,85.13,515.3,0.1402,0.2315,0.3535,0.08088,0.2709,0.08839
268,12.870,16.21,82.38,512.2,0.09425,0.06219,0.03900,0.01615,0.2010,0.05769,...,13.900,23.64,89.27,597.5,0.1256,0.1808,0.1992,0.05780,0.3604,0.07062
435,13.980,19.62,91.12,599.5,0.10600,0.11330,0.11260,0.06463,0.1669,0.06544,...,17.040,30.80,113.90,869.3,0.1613,0.3568,0.4069,0.18270,0.3179,0.10550
132,16.160,21.54,106.20,809.8,0.10080,0.12840,0.10430,0.05613,0.2160,0.05891,...,19.470,31.68,129.70,1175.0,0.1395,0.3055,0.2992,0.13120,0.3480,0.07619
324,12.200,15.21,78.01,457.9,0.08673,0.06545,0.01994,0.01692,0.1638,0.06129,...,13.750,21.38,91.11,583.1,0.1256,0.1928,0.1167,0.05556,0.2661,0.07961
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
514,15.050,19.07,97.26,701.9,0.09215,0.08597,0.07486,0.04335,0.1561,0.05915,...,17.580,28.06,113.80,967.0,0.1246,0.2101,0.2866,0.11200,0.2282,0.06954
538,7.729,25.49,47.98,178.8,0.08098,0.04878,0.00000,0.00000,0.1870,0.07285,...,9.077,30.92,57.17,248.0,0.1256,0.0834,0.0000,0.00000,0.3058,0.09938
269,10.710,20.39,69.50,344.9,0.10820,0.12890,0.08448,0.02867,0.1668,0.06862,...,11.690,25.21,76.51,410.4,0.1335,0.2550,0.2534,0.08600,0.2605,0.08701
245,10.480,19.86,66.72,337.7,0.10700,0.05971,0.04831,0.03070,0.1737,0.06440,...,11.480,29.46,73.68,402.8,0.1515,0.1026,0.1181,0.06736,0.2883,0.07748


In [15]:
# Predicting the class using predict method.

pred = log_reg.predict(xtest)

In [16]:
pred

array(['M', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B'],
      dtype=object)

In [17]:
# This 'sklearn's metric' is used for model evaluation, it shows how well our model predicted the output and how many false values/outputs it produced.

from sklearn.metrics import classification_report,accuracy_score

In [18]:
predicted = log_reg.predict(xtest)

report = classification_report(ytest, pred)

print(report)

              precision    recall  f1-score   support

           B       0.89      1.00      0.94         8
           M       1.00      0.75      0.86         4

    accuracy                           0.92        12
   macro avg       0.94      0.88      0.90        12
weighted avg       0.93      0.92      0.91        12



# Exploring the Classification report: 
A Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report as shown above.


**In predictive analytics, when deciding between two models it is important to pick a single performance metric. As you can see here, there are many that you can choose from (accuracy, recall, precision, f1-score, AUC, etc). Ultimately, you should use the performance metric that is most suitable for the business problem at hand.** 

In [28]:
# Creating the confusion matrix: 

from sklearn.metrics import confusion_matrix, accuracy_score

# accuracy_score tells how well our model predict the right class,as in our case: 'cancer type'.
cm = confusion_matrix(ytest, pred)

print(cm)


[[8 0]
 [1 3]]


# The above confusion matrix explain these four elements: 

* True Positives : The cases in which we predicted YES and the actual output was also YES.
* True Negatives : The cases in which we predicted NO and the actual output was NO.
* False Positives : The cases in which we predicted YES and the actual output was NO.
* False Negatives : The cases in which we predicted NO and the actual output was YES.

In [29]:

print("Accuracy score:",accuracy_score(ytest,pred))

Accuracy score: 0.9166666666666666
