# Logistic Regression

**We will use the Breast Cancer Winsconsin dataset available in sklearn to predict whether or not a tumor is likely cancerous (malignant) or benign, and since there are only two categories this makes the classification task perfect for logistic regression.**

**This data set consists of 569 instances and 30 features and 1 target. There two classes in the data 'Malignant' and 'Benign'. 
We will be classifying the data into the two classes Malignant and Benign based on its features.**

**To know more about sklearn datasets please refer to the following link: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets**

In [1]:
import pandas as pd # to manipulate data using dataframes
import numpy as np  # to perform mathematical operations on arrays

from sklearn.datasets import load_breast_cancer # to download the data from datasets provided in sklearn

# to visualise the results
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

There are 10 different cell nuclei parameters:

**Radius:** Distance from the centre to the perimeter.

**Perimeter:** The value of core tumour. The total distance between the points give perimeter.

**Area:** Area of cancer cells.

**Smoothness:** This gives the local variation in the radius lengths. The smoothness is given by difference of radial length and mean lengths of the lines around it.

**Compactness:** It is value of estimation of perimeter and area,it is given by (perimeter^2 / area - 1.0).

**Concavity:** Severity of concave points is given . Smaller chords encapsulate small concavities better. This feature is affected by length

**Concave points:** The concavity measures magnitude of contour concavities while concave points measures the number of concave points

**Symmetry:** The longest chord is taken as major axis.The length difference between the line perpendicular to the major axis is taken. This is known as the symmetry.

**Fractal dimension:** It is a measure of non linear growth. As the ruler used to measure the perimeter increases, the precision decreases and hence the perimeter decreases. This data is plotted using log scale and the downward slope gives us an approximation of fractal dimension

**Texture:** Standard derivation of the Gray scale area. This is helpful to find out the variation.

Higher value of all the shape features imply irregular contour which in turn implies a malignant cell. 

The worst and error values are taken because only few malignant cells maybe present in an given sample.To better correlate malignant cells, these values are taken. The surgery depends on the size of tumour hence worst values are necessary.

The target value is **0 for malignant** and **1 for benign**.

In [2]:
# Loading the dataset
dataset=load_breast_cancer()

# converting it into a pandas dataframe
data_df= pd.DataFrame(columns=list(dataset.feature_names), data=dataset.data)
data_df['Target']=list(dataset.target)
data_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,Target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [3]:
# checking the details of the dataframe
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

**We can see from the above details that our dataframe 'data_df' has 569 instances for 31 columns (30 features and 1 target) and that none of the column have any missing values. The datatype of each column is also displayed.**


# Features and Target

**We will now separate the features and target into two variables x and y respectively.
It should be noted that x would be our independent variable and y would be dependent variable since its value depends on the combined values of all the features in x.**

In [4]:
# separating features into a variable called x (independent) 
x=data_df[list(dataset.feature_names)]
# displaying the shape of x
print(x.shape)

# # separating targets into a variable called y (dependent)
y=data_df['Target']
# displaying the shape of y
print(y.shape)

(569, 30)
(569,)


**Now we will preprocess the data. In this step we will scale our features so that they become uniform throughout and that one significant number doesn’t impact the model just because of their large magnitude.**

In [5]:
from sklearn.preprocessing import StandardScaler # to scale the data

# first create an object of StandardScaler class
scale=StandardScaler()
# then transform the data
x_scaled=scale.fit_transform(x)

**Now that our data has been transformed to better suit our model we will now split x_scaled and y into train and test sets (80:20 respectively) using train_test_split from sklearn.preprocessing, here is the documentation, have a look at it: 
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20test#sklearn.model_selection.train_test_split.**


**The train set is used for training the model and the test set is used for measuring the peroformance of the model as this is set is not preknown to the model.**

In [6]:
from sklearn.model_selection import train_test_split # to split the data into train and test set
x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,test_size=0.2,random_state=0)

In [7]:
# displaying the size of the arrays after splitting
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(455, 30)
(455,)
(114, 30)
(114,)


**We can observe that now the shape of our data has changed. The train set has 80% instances and the test set has 20% instances of the original data.**

**Since we have prepared our data for the model, we are now ready to perform the task of classifying the cells into one of the two classes: Malignant or Benign.**

**We will be using Logistic Regression class available in scikit learn. Although the algorithm is called “Logistic Regression”, it is, in fact, a classification algorithm, not a regression algorithm. This can be confusing at first, but just try to remember it.**

**Logistic regression is used when the value of the target variable is categorical in nature. It is most commonly used when the data in question has binary output, i.e. when it belongs to one class or another, or is either a 0 or 1.**

In [8]:
from sklearn.linear_model import LogisticRegression # to perform Logistic Regression

# creating an object of LogisticRegression class
log_reg=LogisticRegression()

# fit the model on trainig data
log_reg.fit(x_train,y_train)

LogisticRegression()

**Now our Logistic regression model is trained using the train data set. We will now make predictions using this model.**

**What needs to be understood is that when we train the model using the train set , the model learns the relationship between x_train and y train and while making predictions the model maps the same relationship between x_test and y_pred.**

**Since in this case we already know the true values of scores corresponding to the x_test (i.e. y_test), therefore we can measure the performance of the model by calculating the mean absolute error (one of many metrics that can be used for this purpose).**

In [9]:
# predicting the values of y_test using the model trained on x_train and y_train
y_pred=log_reg.predict(x_test)

**We will now evaluate the performance of our model.**

In [15]:
from sklearn.metrics import mean_absolute_error,accuracy_score  # to calculate mean absolute error
MAE=mean_absolute_error(y_test,y_pred)
print('Mean absolute error: ',MAE)
Accuracy=accuracy_score(y_test,y_pred)
print("Accuracy of the model is:",round(Accuracy*100,2),'%')

Mean absolute error:  0.03508771929824561
Accuracy of the model is: 96.49 %


**The model has a very low value of mean absolute error and an accuracy of 96% thus in most cases it makes correct predictions of class.**

**We will visualize the actual and predicted values in a dataframe.**

In [11]:
predictions_df=pd.DataFrame(columns=['y_test(Actual)','y_pred(Predicted)'])
predictions_df['y_test(Actual)']=y_test
predictions_df['y_pred(Predicted)']=y_pred
predictions_df

Unnamed: 0,y_test(Actual),y_pred(Predicted)
512,0,0
457,1,1
439,1,1
298,1,1
37,1,1
...,...,...
213,0,1
519,1,1
432,0,0
516,0,0


**Now our model is ready and we can make predicition for the class of any new data having the same format as that of the features. Let's try and predict the class of an arbitrary data.**

In [12]:
# creating an instance with random values of the 30 features
x_new=np.array([1.689e+01, 1.538e+01, 1.208e+02, 2.005e+03, 0.034e-01, 1.976e-01,
       2.881e-01, 1.471e-01, 3.419e-01, 6.571e-02, 1.055e+00, 9.000e-01,
       8.779e+00, 2.004e+02, 6.499e-03, 4.404e-02, 5.003e-02, 1.367e-02,
       3.563e-02, 6.233e-03, 2.008e+01, 1.573e+01, 0.846e+02, 1.919e+03,
       1.622e-01, 6.956e-01, 7.899e-01, 2.554e-01, 4.600e-01, 1.109e-01])

In [13]:
# scaling the data point after reshaping the arbitrary data point so that it can be fit into the model
x_new_scaled=scale.fit_transform(x_new.reshape(-1,1))

In [14]:
# Predicting the class of this instance of data using our trained logistic regression model
log_reg.predict(x_new_scaled.reshape(1,30))

array([0], dtype=int64)

**The predicted class of our arbitrary data came out to be 0 which corresponds to Malignant.** 


In this way we can make predictions for new data based on models trained on previous but similiar data. Try it out yourself!