## Introduction: Logistic Regression for Cancer Diagnosis

Cancer remains one of the most formidable health challenges worldwide, affecting millions of lives annually. Timely and accurate diagnosis is paramount for effective treatment and improved patient outcomes. In the realm of medical diagnostics, machine learning algorithms play a crucial role in assisting healthcare professionals with early detection and classification of diseases, including cancer. 🩺🔬

Logistic regression, a fundamental machine learning technique, has garnered significant attention in medical research and clinical practice due to its simplicity, interpretability, and effectiveness in binary classification tasks. Despite its name, logistic regression is commonly used for classification rather than regression problems, making it a valuable tool in medical diagnosis. 💻📈

In this notebook, we explore the application of logistic regression specifically in the context of cancer diagnosis. We leverage publicly available datasets containing features derived from biomedical imaging, genetic markers, and clinical parameters to develop a predictive model capable of distinguishing between malignant and benign tumors. 🧬📊

Through this exploration, we aim to:
- Highlight the significance of machine learning in healthcare and cancer diagnosis.
- Demonstrate the practical implementation of logistic regression for binary classification tasks.
- Showcase the steps involved in data preprocessing, model training, evaluation, and interpretation.
- Provide insights into feature importance and model performance metrics relevant to cancer diagnosis. 📝👩‍⚕️

By understanding and implementing logistic regression for cancer diagnosis, we contribute to the ongoing efforts in leveraging data-driven approaches to enhance medical decision-making, improve patient care, and ultimately combat the global burden of cancer. 🌍💪

***Join us on this journey as we delve into the world of machine learning and its pivotal role in cancer diagnostics. 🚀***

### ***Log Regression theory recap***

![log_regression_model.png](attachment:4ef46d5d-f751-43e4-97f9-63b5332ade27.png)

#### ***Logistic regression is a statistical method used for binary classification tasks, where the goal is to predict the probability that an instance belongs to a particular class. Despite its name, logistic regression is actually a linear model that predicts the probability using a logistic (sigmoid) function.***

##### ***Here's a brief overview of how logistic regression works:***

1. **Model Representation**: In logistic regression, we represent the relationship between the independent variables $( X )$ and the binary dependent variable $( Y )$ using the logistic function. The logistic function is defined as:

   $ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \ldots + \beta_n X_n)}}$

   Where:
   - $( P(Y=1|X) )$ is the probability that $( Y )$ is equal to 1 given $( X )$.
   - $( \beta_0, \beta_1, \ldots, \beta_n )$ are the coefficients of the model.
   - $( X_1, X_2, \ldots, X_n )$ are the independent variables.

2. **Training**: The goal of training logistic regression is to find the optimal values for the coefficients $( \beta_0, \beta_1, \ldots, \beta_n )$ that best fit the training data. This is typically done using maximum likelihood estimation or gradient descent optimization techniques.

3. **Prediction**: Once the model is trained, it can be used to predict the probability of an instance belonging to a particular class. If the predicted probability is greater than a chosen threshold (e.g., 0.5), the instance is classified as belonging to that class; otherwise, it is classified as belonging to the other class.

Mathematically, logistic regression minimizes a loss function called the logistic loss or cross-entropy loss. The loss function is defined as:

$\text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$

Where:
- $( N )$ is the number of samples in the dataset.
- $( y_i )$ is the true label of the $( i )$-th sample (0 or 1).
- $( \hat{y}_i )$ is the predicted probability that the $( i )$-th sample belongs to class 1.

***Logistic regression is widely used in various fields due to its simplicity, interpretability, and effectiveness in binary classification tasks.***

![self-promotion.jpeg](attachment:53c83513-a1e2-4f83-8017-86343c90eae4.jpeg)

### ***More detailed (and beginner-friendly) can be found in the following notebook (including step-by-step implementation from scratch:***
***[Logistic Regression Demystified: Beginner's Guide](https://www.kaggle.com/code/daniilkrasnoproshin/logistic-regression-demystified-beginner-s-guide)***

### ***Import required dependecies***

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.linear_model import LogisticRegression

In [3]:
data = pd.read_csv('/content/breast_cancer_bd.csv')
data.head(15)

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4
6,1018099,1,1,1,1,2,10,3,1,1,2
7,1018561,2,1,2,1,2,1,3,1,1,2
8,1033078,2,1,1,1,2,1,1,1,5,2
9,1033078,4,2,1,1,2,1,2,1,1,2


In [4]:
data = data.drop('Sample code number', axis=1)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Clump Thickness              699 non-null    int64 
 1   Uniformity of Cell Size      699 non-null    int64 
 2   Uniformity of Cell Shape     699 non-null    int64 
 3   Marginal Adhesion            699 non-null    int64 
 4   Single Epithelial Cell Size  699 non-null    int64 
 5   Bare Nuclei                  699 non-null    object
 6   Bland Chromatin              699 non-null    int64 
 7   Normal Nucleoli              699 non-null    int64 
 8   Mitoses                      699 non-null    int64 
 9   Class                        699 non-null    int64 
dtypes: int64(9), object(1)
memory usage: 54.7+ KB


In [6]:
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [7]:
print(data['Bare Nuclei'].unique())

['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']


In [8]:
# data.drop(data.loc[data['Bare Nuclei']=='?'].index, inplace=True)

In [9]:
data = data.replace('?', np.nan)

data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Clump Thickness              699 non-null    int64  
 1   Uniformity of Cell Size      699 non-null    int64  
 2   Uniformity of Cell Shape     699 non-null    int64  
 3   Marginal Adhesion            699 non-null    int64  
 4   Single Epithelial Cell Size  699 non-null    int64  
 5   Bare Nuclei                  683 non-null    float64
 6   Bland Chromatin              699 non-null    int64  
 7   Normal Nucleoli              699 non-null    int64  
 8   Mitoses                      699 non-null    int64  
 9   Class                        699 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 54.7 KB


In [10]:
data['Bare Nuclei'] = data['Bare Nuclei'].fillna(data['Bare Nuclei'].median())

In [11]:
y = data['Class']
X = data.drop('Class', axis=1)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3)

In [13]:
lr_model = LogisticRegression()

lr_model.fit(X_train, y_train)

In [14]:
y_pred_lr = lr_model.predict(X_test)
score = accuracy_score(y_test,y_pred_lr)
print('Accuracy score: {}'.format(score))

Accuracy score: 0.9809523809523809


#### ***Now let's compare logistic regression with SVM***

In [15]:
svc_model = SVC(C= .1, kernel='linear', gamma= 1)

svc_model.fit(X_train, y_train)

In [16]:
y_pred_svm = svc_model.predict(X_test)
score = accuracy_score(y_test,y_pred_svm)
print('Accuracy score: {}'.format(score))

Accuracy score: 0.9809523809523809
