**LEARNING TO APPLY GRID SEARCH CROSS VALIDATION FOR LOGISTIC REGRESSION CLASSIFICATION MODELS**

In this kernel I am using breast cancer dataset to create a logistic regression machine learning model.

To improve my model I will use grid search cross validation.

Grid search cross validation method will give me the best parameters, so I will use these parameters to improve my logistic regression model.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir("../input"))

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

['Breast_cancer_data.csv']


**Analysis of the Dataset:**

In [2]:
data = pd.read_csv('../input/Breast_cancer_data.csv')
data.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0
1,20.57,17.77,132.9,1326.0,0.08474,0
2,19.69,21.25,130.0,1203.0,0.1096,0
3,11.42,20.38,77.58,386.1,0.1425,0
4,20.29,14.34,135.1,1297.0,0.1003,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 6 columns):
mean_radius        569 non-null float64
mean_texture       569 non-null float64
mean_perimeter     569 non-null float64
mean_area          569 non-null float64
mean_smoothness    569 non-null float64
diagnosis          569 non-null int64
dtypes: float64(5), int64(1)
memory usage: 26.8 KB


I can see there is no NaN value in this dataset. So I don't need to clean it before use.

In [4]:
data.describe()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
count,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.0
50%,13.37,18.84,86.24,551.1,0.09587,1.0
75%,15.78,21.8,104.1,782.7,0.1053,1.0
max,28.11,39.28,188.5,2501.0,0.1634,1.0


From dataset statistics I can see that later I will need to normalize values. Because the features' values range in a big scale.

In [5]:
data.corr()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
mean_radius,1.0,0.323782,0.997855,0.987357,0.170581,-0.730029
mean_texture,0.323782,1.0,0.329533,0.321086,-0.023389,-0.415185
mean_perimeter,0.997855,0.329533,1.0,0.986507,0.207278,-0.742636
mean_area,0.987357,0.321086,0.986507,1.0,0.177028,-0.708984
mean_smoothness,0.170581,-0.023389,0.207278,0.177028,1.0,-0.35856
diagnosis,-0.730029,-0.415185,-0.742636,-0.708984,-0.35856,1.0


From the dataset correlation statistics I can easly see that 'radius', 'perimeter' and 'area' features are strongly related.

Let's see how many of the dataset inputs are diagnosed as malignant (1) or belign (0):

In [6]:
data['diagnosis'].value_counts()

1    357
0    212
Name: diagnosis, dtype: int64

**1. First of all I will separete diagnosis feature from the dataset. Diagnosis values will be my target (y).**

In [7]:
y = data.diagnosis.values
x = data.drop('diagnosis', axis=1)
x.head(3)

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness
0,17.99,10.38,122.8,1001.0,0.1184
1,20.57,17.77,132.9,1326.0,0.08474
2,19.69,21.25,130.0,1203.0,0.1096


**2. Now I will implement normalization process to my x values.**

In [8]:
x = (x-np.min(x))/(np.max(x)-np.min(x))
x.describe()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness
count,569.0,569.0,569.0,569.0,569.0
mean,0.338222,0.323965,0.332935,0.21692,0.394785
std,0.166787,0.145453,0.167915,0.149274,0.126967
min,0.0,0.0,0.0,0.0,0.0
25%,0.223342,0.218465,0.216847,0.117413,0.304595
50%,0.302381,0.308759,0.293345,0.172895,0.390358
75%,0.416442,0.40886,0.416765,0.271135,0.47549
max,1.0,1.0,1.0,1.0,1.0


**3. Divide the dataset into train and test:**

In [9]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

In [10]:
print('x_train.shape:', x_train.shape)
print('y_train.shape:', y_train.shape)
print('x_test.shape :', x_test.shape)
print('y_test.shape :', x_test.shape)

x_train.shape: (398, 5)
y_train.shape: (398,)
x_test.shape : (171, 5)
y_test.shape : (171, 5)


In this kernel my aim is to investigate two different scenerios;
*     Scenario_1: Apply Logistic Regression Classification and examine the accuracy of the model,
*     Scenario_2: Apply Grid Search Cross Validation and use these parameters in Logistic Regression Class. in order to improve accuracy.




**4. Scenerio_1 Applying Logistic Regression Classification Algorithm Directly:**

In [11]:
from sklearn.linear_model import LogisticRegression

# Creating the model:
lr = LogisticRegression() 

# Training the model with the training datas:
lr.fit(x_train, y_train)

print('Scenario_1 score of the logistic regression: ', lr.score(x_test, y_test))

Scenario_1 score of the logistic regression:  0.9005847953216374


**5. Scenario_2 Apply Grid Search Cross Validation for Logistic Regression:**

In [12]:
from sklearn.model_selection import GridSearchCV

grid = {'C': np.logspace(-3,3,7), 'penalty': ['l1', 'l2']}
# C and penalty are logistic regression regularization parameters
# If C is too small model is underfitted, if C is too big model is overfitted.
# l1 and l2 are regularization loss functions (l1=lasso, l2=ridge)

# Creating the model:
lr = LogisticRegression() 

# Creating GridSearchCV model:
lr_cv = GridSearchCV(lr, grid, cv=10) # Using lr model, grid parameters and cross validation of 10 (10 times of accuracy calculation will be applied) 

# Training the model:
lr_cv.fit(x_train, y_train)

print('best paremeters for logistic regression: ', lr_cv.best_params_)
print('best score for logistic regression after grid search cv:', lr_cv.best_score_)



best paremeters for logistic regression:  {'C': 100.0, 'penalty': 'l1'}
best score for logistic regression after grid search cv: 0.9296482412060302


After the grid search cross validation for logistic regression I found that logistic regression regularization parameters should be;
* C = 100
* penalty = l1

for the best scored logistic regression model.

In [13]:
lr_tuned = LogisticRegression(C=100.0, penalty='l1')

lr_tuned.fit(x_train, y_train)

print('Scenario_2 (tuned) logistic regression score: ', lr_tuned.score(x_test, y_test))

Scenario_2 (tuned) logistic regression score:  0.9181286549707602


**CONCLUSION:**

In order to improve our models accuracy we should apply grid search cross validation before to find the best parameters. 

Then we can use these regularization parameters to improve our logistic regression classification model. 