# Predicting Cancer Diagnosis Using Binary Logistic Regression

### Perspective:

* The model demonstrates the application of data analytics and machine learning to predict the likelihood of a patient to actually suffer
  from cancer based on a cancer diagnostic dataset.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('cancer.csv')
df.sample(7)

Unnamed: 0,id,Clump Thickness,UofCSize,UofCShape,Marginal Adhesion,SECSize,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
169,1200952,5,8,7,7,10,10,5,7,1,4
323,798429,1,1,1,1,2,1,3,1,1,2
65,1118039,5,3,4,1,8,10,4,9,1,4
21,1054593,10,5,5,3,6,7,7,10,1,4
490,1277268,3,3,1,1,2,1,1,1,1,2
518,1043068,3,1,1,1,2,1,2,1,1,2
103,1169049,7,3,4,4,3,3,3,2,7,4


In [7]:
df.columns

Index(['id', 'Clump Thickness', 'UofCSize', 'UofCShape', 'Marginal Adhesion',
       'SECSize', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli',
       'Mitoses', 'Class'],
      dtype='object')

- Let us explore the columns:
  1. **id:** A unique identifier for each patient or sample. This column is not relevant for prediction; we will drop it during data preprocessing.
  2. **Clump Thickness:** Measures the uniformity of the thickness of cell clumps. Higher values may indicate a higher likelihood of malignancy.
  3. **UofCSize:** This is actually **Uniformity Of Cell Size** . A measure of how consistent the size of the cells is. Cancerous cellls tend to
     vary more in size compared to non cancerous.
  4. **UofCShape:** in full **Uniformity of Cell Shape** . It's measures of consistency of the cell's  shape. Malignant cells often have irregular
     shapes.
  5. **Marginal Adhesion:** Assesses how well the cells stick to each other. Cancerous cells often have reduced adhesion,so higher values may indicate
     malignancy. 
  6. **SECSize:** (**Single Epithelial Cell Size**) . Measures the size of single epithelial cells. Larger size may indicate malignancy.
  7. **Bare Nuclei:** Counts the number of nuclei that appear bare or without cytolplasm. It's a strong indicator of malignancy.
  8. **Bland Chromatin:** Describes the texture or apppearances of the cell's chromatin under a microscope. Cancerous cells may show abnormal chromatin
     texture.
  9. **Normal Nucleoli:** Measures the number of nucleoi present in the cells. Malignant cellls often have prominent or multiple nucleoli.
  10. **Mitoses:** Counts the number of mitotic figuers,which are cells in the process of dividing. Higher mitotic rates often indicates malignancy.
  11. **Class:** The target variable indicating whether the tumor is benign  or malignant. We will adjust the column to categorical codes to typically
      represent 0(benign) and 1(maligant) in binary classification.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   id                 683 non-null    int64
 1   Clump Thickness    683 non-null    int64
 2   UofCSize           683 non-null    int64
 3   UofCShape          683 non-null    int64
 4   Marginal Adhesion  683 non-null    int64
 5   SECSize            683 non-null    int64
 6   Bare Nuclei        683 non-null    int64
 7   Bland Chromatin    683 non-null    int64
 8   Normal Nucleoli    683 non-null    int64
 9   Mitoses            683 non-null    int64
 10  Class              683 non-null    int64
dtypes: int64(11)
memory usage: 58.8 KB


In [6]:
df.isnull().any()
#print('There are no missing values.')

id                   False
Clump Thickness      False
UofCSize             False
UofCShape            False
Marginal Adhesion    False
SECSize              False
Bare Nuclei          False
Bland Chromatin      False
Normal Nucleoli      False
Mitoses              False
Class                False
dtype: bool

In [8]:
df["Class"] = df["Class"].astype('category')
df.dtypes

id                      int64
Clump Thickness         int64
UofCSize                int64
UofCShape               int64
Marginal Adhesion       int64
SECSize                 int64
Bare Nuclei             int64
Bland Chromatin         int64
Normal Nucleoli         int64
Mitoses                 int64
Class                category
dtype: object

In [9]:
df["Class"] = df["Class"].cat.codes
df.head(10)

Unnamed: 0,id,Clump Thickness,UofCSize,UofCShape,Marginal Adhesion,SECSize,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,0
1,1002945,5,4,4,5,7,10,3,2,1,0
2,1015425,3,1,1,1,2,2,3,1,1,0
3,1016277,6,8,8,1,3,4,3,7,1,0
4,1017023,4,1,1,3,2,1,3,1,1,0
5,1017122,8,10,10,8,7,10,9,7,1,1
6,1018099,1,1,1,1,2,10,3,1,1,0
7,1018561,2,1,2,1,2,1,3,1,1,0
8,1033078,2,1,1,1,2,1,1,1,5,0
9,1033078,4,2,1,1,2,1,2,1,1,0


## How the features relate to Cancer Diagnostics:

* Features like Bare Nuclei,Clump Thickness,UofCSize,and UofCShape are typically the most predictive.
* Features like Marginal Adhesion and Mitoses add suplementary information to improve model accuracy.

In [10]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

## Splitting the data into Features(X) And Target(y) labelsets

In [11]:
x = pd.DataFrame(df.iloc[:,:-1])
X = x.drop('id',axis = 1)
X.head()

Unnamed: 0,Clump Thickness,UofCSize,UofCShape,Marginal Adhesion,SECSize,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1


In [12]:
y = pd.DataFrame(df.iloc[:,-1])
y.head()

Unnamed: 0,Class
0,0
1,0
2,0
3,0
4,0


## Logistic Regression model fitting

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


### Predicting the test set results and calculating the accuracy

In [14]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.96


* The model demonstrated an excellent predictive perfomance, with an accuracy of 96%. This reflects the model's strong ability
  to capture patterns in the data.

### Confusion Matrix

In [15]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

[[126   4]
 [  5  70]]


- Let's interprete the confusion matrix:

 **Row 1:** Patients who are actually cancer-free.
 **Row 2:** Patients who are actually have cancer.
 **Column 1:** Predicted as cancer-free.
 **Column 2:** Predicted as having cancer.

**True positive(TP) = <u>70<u>**:
-These are patients who actually have cancer and were correctly prediced by the model as having cancer.

**False negative(FN) = <u>5<u>**:
-These are patients who actually have cancer but were incorrectly predicted by the model as cancer-free.

**False positive(FP) = <u>4<u>**:
-These are patients who are actually cancer-free but were incorrectly predicted by the model as having cancer.

**True negative(TN)  = <u>126<u>**:
-These are the patients who are actually cancer-free and were correctly predicted as cancer-free by the model.

In [16]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.97      0.97       130
           1       0.95      0.93      0.94        75

    accuracy                           0.96       205
   macro avg       0.95      0.95      0.95       205
weighted avg       0.96      0.96      0.96       205



-From the classificaton report we can conclude that the model performs well overall,with:

 1. **Strong cancer detection(high recall)**: Only 5 cancer patients were missed.
 2. **Minimal false positives(high specificity)**![image.png](attachment:image.png) Only 4 healthy individuals were misdiagnosed as having cancer.
 
 3.**Balanced performance**: High precision and recall ensures reliable cancer detection while minimizing unnecssary
    alarms.

-This makes our model a useful diagnostic tool,but reducing false negatives(FN) is crical since missing cancer cases can have severe consequences.
       