#### **Optional Question: Implemenent  a Logitsic Regression classifier on the breast cancer dataset**


**Name: Mubanga Nsofu** <br>
**Course: BAN6420, Module 5**<br>
**Email: mnsofu@learner.nexford.org** <br>
**Leaner Id: 149050** <br>
**Institution: Nexford University**<br>
**Lecturer: Prof. R. Wanjiku** <br>
**Date: 3rd August 2024**
**Task: Logistic Regressioin on Breast Cancer Data**


**1.0 Introduction**

*Logistic regression developed in the 1940s  as an alterntive approach to Linear Discriminant Analysis(LDA). It is a statitistical learning paradigm is applied to qualitative predictors.* <br>
*A logistic regressor models the probability the ordinate belongs to a particular category of the abscissa, as shown below* <br>

$p(X) = Pr(Y = 1|X)$ <br>

*It is used for binary classification problems ((Hosmer, Lemeshow, & Sturdivant, 2013).
As  can be seen from the equation above, it is used to  model the probability of an outcome based on one or more predictor variables (Kleinbaum & Klein, 2010).
To this end we implement a logistic regression on the breast cancer dataset from sklearn as requested in the assignment*

**2.0 Import the necessary libraries**

In [1]:
#  Let us import the necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np
import sweetviz as sv

# For reproducibility
np.random.seed(42)


**3.0 Let us load and preprocess the dataset**

In [3]:
# Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


**4.0 We split the dataset because this is a supervised learning problem**

In [6]:
# Using the code below, we split the dataset into training and testing sets, and we ensure the data is scaled and the 
# results can be reproducible using the random_state argument
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


**5.0 We Train the Logistic Regression Model**

In [None]:
# Let us instantiate the logistic regression model class from sklearn
logistic_regressor = LogisticRegression()

# Then we train the model using the training data
logistic_regressor.fit(X_train, y_train)


**6.0 Logistic Regression Model Evaluation**

In [8]:
# Predict on the test data
y_pred = logistic_regressor.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Print the classification report
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))


Accuracy: 0.97
              precision    recall  f1-score   support

      Benign       0.98      0.95      0.96        43
   Malignant       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Accuracy: 0.97
              precision    recall  f1-score   support

      Benign       0.98      0.95      0.96        43
   Malignant       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



**7.0 Results Interpretation**

**7.1. Precision is defined as the ratio of correctly predicted positive observations(TP) to the total predicted positives(TP+FP). Mathematically we can say:** <br>

$precision = TP/(TP+FP)$ <br>

*Where: TP = True Positives and FP = False Positives* <br>

*In the context of our results:* <br>
**Benign:** Precision is 0.98, meaning 98% of the samples predicted as Benign are actually Benign. <br>
**Malignant:** Precision is 0.97, meaning 97% of the samples predicted as Malignant are actually Malignant.

**7.2. Recall is defined as the ratio of correctly predicted positive observations(TP) to all observations in the actual class, essentially how well does our logistic regressor predict actual positives (TP's). Mathematically we can say:** <br>

$recall = TP/(TP+FN)$ <br>

*Where: TP = True Positives and FP = False Negatives* <br>

*In the context of our results:* <br>
**Benign:** Recall is 0.95, meaning 95% of the actual Benign samples were correctly predicted by our model. <br>
**Malignant:** Precision is 0.99, meaning 99% of the actual Malignant samples were correctly identified by our model.

**7.3. F1-Score is the harmonic mean of the precision and the recall. This metric is important because it provides a single metric that balances both the requirements of precision and recall. Mathematically we can say:** <br>

$F1-Score = 2*(Precision*Recall)/(Precision+Recall)$ <br>



*In the context of our results:* <br>
**Benign:** F1-Score is 0.96, which indicates a balance between precisioin and recall between precison and recall for Benign samples. <br>
**Malignant:** F1-Score is 0.98, which indicates a balance between precisioin and recall between precison and recall for Malignant samples.

**7.4. Support is the actual number of occurrences of the class in the dataset. It shows how many samples there are for each class** <br>

$F1-Score = 2*(Precision*Recall)/(Precision+Recall)$ <br>



*In the context of our results:* <br>
**Benign:** support  is 43, this means in our dataset there are 43 actuall Benign samples in the test set. <br>
**Malignant:** support is 71, this means in our data there are 71 actual Malignant samples in the test set. <br>

The support indicates class imbalance, howvere the model is performing well nevertheless as indicated by the F1-Score, precision and Recall metrics. These are safety nets as opposed to only looking at model accuracy!

**8.0 References**

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Wiley. https://doi.org/10.1002/9781118548387 <br>

Kleinbaum, D. G., & Klein, M. (2010). Logistic regression: A self-learning text (3rd ed.). Springer. https://doi.org/10.1007/978-1-4419-1742- <br>

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. http://www.deeplearningbook.org

